The Crawler is a method of extracting data from websites to bring into Yext. Data is scraped from specified web pages and then ingested via a connector. Crawlers can be configured to run on a one-time or ongoing basis.
This doc covers crawler configuration settings, how to use a crawler as a connector source, and system limits.
Crawler Configuration
Duplicate URLs
If a URL is encountered more than once in a single crawl, only the first instance will be crawled.
Schedule
| Option | Behavior |
|---|---|
| Once | The crawler runs a single time. This is the default. |
| Daily | A new crawl is initiated exactly one day after the previous crawl finishes. |
| Weekly | A new crawl is initiated exactly one week after the previous crawl finishes. |
Inactivity reset. Crawlers set to Daily or Weekly will automatically revert to Once after 14 days of inactivity. A crawler is considered inactive if:
- It is not linked to a connector
- It is linked to a manually-run connector that has not been run in the past 14 days
- It is linked to a manually-run connector whose configuration has not been viewed in the past 14 days
Crawlers reset to Once remain in the platform and can still be viewed and run manually. You can re-add a daily or weekly schedule, but it will reset again if the crawler becomes inactive.
Supported File Types
| Option | Behavior |
|---|---|
| All File Types | All supported file types (HTML and PDF) will be crawled if encountered. Automatically includes any file types added in the future. |
| Select File Types | Only the selected file types will be crawled. |
Note that the crawler can still detect and spider to URLs found in unselected file types without actually crawling them. For example, if you select PDF only but your PDFs are linked from HTML pages, the crawler will spider through the HTML to find the PDFs — but will only crawl the PDFs themselves.
Rate Limit
The maximum number of concurrent tasks the crawler can execute on a site at one time.
| Value | |
|---|---|
| Default | 100 |
| Minimum | 1 |
| Maximum | 15,000 |
Set this based on the maximum number of concurrent requests your site can handle without impacting performance. If too many requests are received, the site will return a 429 error, which will appear in the Status column on the Crawl Details page. If this happens, lower the rate limit. If your site can handle a higher throughput, increase it to speed up the crawl.
Blacklisted URLs
Specify URLs to exclude from crawling, even if they fall within the chosen crawl strategy. You can provide exact URLs or use regex notation to match URL patterns.
To learn more about regex notation, see Mozilla's developer cheat sheet.
Source Type
The crawler supports two source types: Sitemap or Domain.
Sitemap Source Type
Requires a sitemap URL (.xml format). You can also choose whether to check the lastmod tag on URLs in the sitemap.
Domain Source Type
Requires a crawl strategy, one or more start URLs, and optionally a sub-page URL structure.
Pages or Domains to Crawl
The start URL is the first URL crawled and the starting point for all subsequent spidering. You can configure more than one start URL. Start URLs can be as broad or specific as needed — all of the following are valid:
yext.comblog.yext.comhttps://www.yext.comhttps://www.yext.com/blog/2023/05/why-tech-leaders-are-embracing-custom-built-dxp
Including https:// is optional — the crawler resolves the HTTPS protocol automatically.
Crawl Strategy
Determines which pages are crawled from the start URL. The crawler detects pages via URLs in href tags within the HTML (e.g., <a href="www.yext.com">), then spiders to each and repeats, up to 100,000 URLs (see System Limits below).
| Strategy | Behavior |
|---|---|
| All Pages | Crawls all pages on the same root domain and subdomain as the start URL, including pages at a higher subdirectory level than the start URL. |
| Sub-Pages | Crawls only pages at a lower subdirectory level than the start URL, on the same subdomain and root domain. |
| Specific Pages | Only the URLs listed as start URLs will be crawled. No spidering occurs. |
All Pages example (start URL: www.yext.com/blog):
| Page URL | Crawled | Reason |
|---|---|---|
www.yext.com |
Yes | Subdomain and root domain match |
www.yext.com/blog/2023/04/use-cases-for-ai |
Yes | Subdomain and root domain match |
www.yext.com/platform |
Yes | Subdomain and root domain match |
help.yext.com |
No | Subdomain does not match |
Sub-Pages example (start URL: www.yext.com/blog):
| Page URL | Crawled | Reason |
|---|---|---|
www.yext.com |
No | Not a sub-page of the start URL |
www.yext.com/blog/2023/04/use-cases-for-ai |
Yes | Sub-page of the start URL |
www.yext.com/platform |
No | Not a sub-page of the start URL |
help.yext.com |
No | Subdomain does not match |
Sub-Page URL Structure
Allows the crawler to spider to sub-pages with a different URL structure than the start URL. Use this to crawl sub-pages across multiple subdirectories on one domain using a single crawler.
For example, to capture sub-pages under both www.yext.com/blog and www.yext.com/faq but not www.yext.com/products, set the start URL to www.yext.com/blog with the Sub-Pages strategy, and add www.yext.com/faq/* as a sub-page URL structure.
| Page URL | Crawled | Reason |
|---|---|---|
www.yext.com |
No | Not a sub-page of the start URL |
www.yext.com/blog/2023/04/use-cases-for-ai |
Yes | Sub-page of the start URL |
www.yext.com/faq/posts/2023 |
Yes | Matches sub-page URL structure |
help.yext.com |
No | Subdomain does not match |
www.yext.com/products/posts/content |
No | Neither a sub-page of the start URL nor matching the sub-page URL structure |
Query Parameter Settings
Designate whether query parameters should be ignored when differentiating between crawled URLs. This prevents the crawler from treating the same page with different query strings as distinct URLs.
| Option | Behavior |
|---|---|
| All | All query parameters are ignored |
| Specific Parameters | Only the specified parameters are ignored |
| None | No query parameters are ignored. Overrides any Specific Parameters settings. |
All query parameters ignored (start URL: www.yext.com/blog):
| Page URL | Crawled | Reason | Resulting Crawled URL |
|---|---|---|---|
www.yext.com/blog?utm_source=google |
Yes | Matches the start URL | www.yext.com/blog |
www.yext.com/blog?page=11&language=en |
No | Resulting URL duplicates the prior crawled URL | N/A |
Specific parameter language ignored:
| Page URL | Crawled | Reason | Resulting Crawled URL |
|---|---|---|---|
www.yext.com/blog?utm_source=google |
Yes | Parameter not ignored | www.yext.com/blog?utm_source=google |
www.yext.com/blog?language=en&page=11 |
Yes | With language ignored, URL is not a duplicate |
www.yext.com/blog?page=11 |
www.yext.com/blog?language=en&utm_source=google |
No | With language ignored, resulting URL duplicates a prior crawled URL |
N/A |
No query parameters ignored:
| Page URL | Crawled | Reason | Resulting Crawled URL |
|---|---|---|---|
www.yext.com/blog?utm_source=google |
Yes | Seen as a distinct URL | www.yext.com/blog?utm_source=google |
www.yext.com/blog?language=en&page=11 |
Yes | Seen as a distinct URL | www.yext.com/blog?language=en&page=11 |
www.yext.com/blog?language=en&utm_source=google |
Yes | Seen as a distinct URL | www.yext.com/blog?language=en&utm_source=google |
Max Depth (Domain Source Only)
The number of levels past the start URL that the crawler will spider to.
| Value | |
|---|---|
| Default | 10 |
| Minimum | 0 |
| Maximum | 100 |
For example, if the start URL is www.yext.com/blog (depth = 0) and Max Depth is set to 1, the crawler will spider to pages directly linked from www.yext.com/blog (e.g., www.yext.com/blog/2023), but will not continue to pages linked from those results.
If a URL appears at multiple depths during a crawl, the system uses the depth of the first instance encountered.
Crawler Status and Performance
| Page | What It Shows |
|---|---|
| Crawler Overview Page | Status bar with details for the most recent crawl. The Pages tab lists all unique pages crawled across all crawls; details for a given page reflect its most recent crawl. |
| Crawl Details Page | Details for a single crawl, listing all pages crawled in that run. |
Using the Crawler as a Connector Source
The connector pulls data from the most recently completed crawl of the selected crawler.
File Types
Select the types of files to bring in via the connector: HTML, PDF, or both.
This setting does not have to match the file type setting of the crawler itself, but the connector is limited to only the file types that were actually crawled.
Example: If a crawler is configured to crawl both HTML and PDF files, a connector using that crawler can be set to ingest only HTML files. However, if the crawler was set to scrape only HTML files, a connector configured to scrape only PDFs would return no data.
URLs
| Option | Behavior | Details |
|---|---|---|
| All URLs Crawled | Ingests all URLs present in the most recently completed crawl. | |
| Specific URLs or URL Patterns | Specify a comma-separated list of exact URL paths or wildcard patterns. Only matching URLs present in the most recent crawl will be ingested. |
For example, if
|
Page Type
| Option | Behavior |
|---|---|
| Detail Page | Each page corresponds to the data for a single entity. |
| List Page | Multiple entities are contained on a single URL. Only supported for HTML file types (see PDF Support below). The entity container is specified by a CSS or XPath expression pointing to the outer container that includes all the information to extract for each entity. |
PDF Support
If the connector's File Type setting includes PDFs (i.e., PDF Only or PDF and HTML), the Page Type must be set to Detail Page.
When data is extracted from a PDF, each PDF is treated as a single entity. The data is read as field metadata, and the entire body of the PDF is ingested as unstructured content.
HTML-specific selectors (CSS or XPath) are not compatible with PDFs. If a connector is configured to support both HTML and PDF file types, these selectors can still be added — but the corresponding fields will be blank for any PDFs.
Selectors
| Selector | HTML | |
|---|---|---|
| CSS Path | Yes | No |
| XPath | Yes | No |
| Page ID | Yes | Yes |
| Page URL | Yes | Yes |
Page Title (from <title> tag) |
Yes | Yes |
| Cleaned Body Content | Yes | Yes |
| Author | No | Yes |
| Created Date | No | Yes |
If both HTML and PDF file types are selected, all HTML-specific selectors will be available but will return blank values for PDFs. The Author and Created Date selectors will not be available in this combined mode.
System Limits
| Resource | Limit |
|---|---|
| PDF file size | 50 MB |
| PDF body content ingested as body field | 1 MB of text. If exceeded, content is truncated to the first 1 MB. |
| Max Depth | 0 to 100 levels past the root URL |
| Rate Limit | 1 to 15,000 concurrent tasks |
| URLs spidered per crawl | 100,000 maximum. The crawler stops once this limit is reached. |
| Domain | 768 characters maximum |
| Crawler Name | 255 characters maximum |
| URL Pattern | 1 to 2,000 characters |
| Page Load Wait Time | 10 seconds. The crawler waits this long to load the page and execute JavaScript before extracting HTML. |