Crawler Source – Yext Help

The Crawler is a method of extracting data from websites to bring into Yext. Data is scraped from specified web pages and then ingested via a connector. Crawlers can be configured to run on a one-time or ongoing basis.

This doc covers crawler configuration settings, how to use a crawler as a connector source, and system limits.

Crawler Configuration

Duplicate URLs

If a URL is encountered more than once in a single crawl, only the first instance will be crawled.

Schedule

Option	Behavior
Once	The crawler runs a single time. This is the default.
Daily	A new crawl is initiated exactly one day after the previous crawl finishes.
Weekly	A new crawl is initiated exactly one week after the previous crawl finishes.

Inactivity reset. Crawlers set to Daily or Weekly will automatically revert to Once after 14 days of inactivity. A crawler is considered inactive if:

It is not linked to a connector
It is linked to a manually-run connector that has not been run in the past 14 days
It is linked to a manually-run connector whose configuration has not been viewed in the past 14 days

Crawlers reset to Once remain in the platform and can still be viewed and run manually. You can re-add a daily or weekly schedule, but it will reset again if the crawler becomes inactive.

Supported File Types

Option	Behavior
All File Types	All supported file types (HTML and PDF) will be crawled if encountered. Automatically includes any file types added in the future.
Select File Types	Only the selected file types will be crawled.

Note that the crawler can still detect and spider to URLs found in unselected file types without actually crawling them. For example, if you select PDF only but your PDFs are linked from HTML pages, the crawler will spider through the HTML to find the PDFs — but will only crawl the PDFs themselves.

Rate Limit

The maximum number of concurrent tasks the crawler can execute on a site at one time.

	Value
Default	100
Minimum	1
Maximum	15,000

Set this based on the maximum number of concurrent requests your site can handle without impacting performance. If too many requests are received, the site will return a 429 error, which will appear in the Status column on the Crawl Details page. If this happens, lower the rate limit. If your site can handle a higher throughput, increase it to speed up the crawl.

Blacklisted URLs

Specify URLs to exclude from crawling, even if they fall within the chosen crawl strategy. You can provide exact URLs or use regex notation to match URL patterns.

To learn more about regex notation, see Mozilla's developer cheat sheet.

Source Type

The crawler supports two source types: Sitemap or Domain.

Sitemap Source Type

Requires a sitemap URL (.xml format). You can also choose whether to check the lastmod tag on URLs in the sitemap.

Domain Source Type

Requires a crawl strategy, one or more start URLs, and optionally a sub-page URL structure.

Pages or Domains to Crawl

The start URL is the first URL crawled and the starting point for all subsequent spidering. You can configure more than one start URL. Start URLs can be as broad or specific as needed — all of the following are valid:

yext.com
blog.yext.com
https://www.yext.com
https://www.yext.com/blog/2023/05/why-tech-leaders-are-embracing-custom-built-dxp

Including https:// is optional — the crawler resolves the HTTPS protocol automatically.

Crawl Strategy

Determines which pages are crawled from the start URL. The crawler detects pages via URLs in href tags within the HTML (e.g., <a href="www.yext.com">), then spiders to each and repeats, up to 100,000 URLs (see System Limits below).

Strategy	Behavior
All Pages	Crawls all pages on the same root domain and subdomain as the start URL, including pages at a higher subdirectory level than the start URL.
Sub-Pages	Crawls only pages at a lower subdirectory level than the start URL, on the same subdomain and root domain.
Specific Pages	Only the URLs listed as start URLs will be crawled. No spidering occurs.

All Pages example (start URL: www.yext.com/blog):

Page URL	Crawled	Reason
`www.yext.com`	Yes	Subdomain and root domain match
`www.yext.com/blog/2023/04/use-cases-for-ai`	Yes	Subdomain and root domain match
`www.yext.com/platform`	Yes	Subdomain and root domain match
`help.yext.com`	No	Subdomain does not match

Sub-Pages example (start URL: www.yext.com/blog):

Page URL	Crawled	Reason
`www.yext.com`	No	Not a sub-page of the start URL
`www.yext.com/blog/2023/04/use-cases-for-ai`	Yes	Sub-page of the start URL
`www.yext.com/platform`	No	Not a sub-page of the start URL
`help.yext.com`	No	Subdomain does not match

Sub-Page URL Structure

Allows the crawler to spider to sub-pages with a different URL structure than the start URL. Use this to crawl sub-pages across multiple subdirectories on one domain using a single crawler.

For example, to capture sub-pages under both www.yext.com/blog and www.yext.com/faq but not www.yext.com/products, set the start URL to www.yext.com/blog with the Sub-Pages strategy, and add www.yext.com/faq/* as a sub-page URL structure.

Page URL	Crawled	Reason
`www.yext.com`	No	Not a sub-page of the start URL
`www.yext.com/blog/2023/04/use-cases-for-ai`	Yes	Sub-page of the start URL
`www.yext.com/faq/posts/2023`	Yes	Matches sub-page URL structure
`help.yext.com`	No	Subdomain does not match
`www.yext.com/products/posts/content`	No	Neither a sub-page of the start URL nor matching the sub-page URL structure

Query Parameter Settings

Designate whether query parameters should be ignored when differentiating between crawled URLs. This prevents the crawler from treating the same page with different query strings as distinct URLs.

Option	Behavior
All	All query parameters are ignored
Specific Parameters	Only the specified parameters are ignored
None	No query parameters are ignored. Overrides any Specific Parameters settings.

All query parameters ignored (start URL: www.yext.com/blog):

Page URL	Crawled	Reason	Resulting Crawled URL
`www.yext.com/blog?utm_source=google`	Yes	Matches the start URL	`www.yext.com/blog`
`www.yext.com/blog?page=11&language=en`	No	Resulting URL duplicates the prior crawled URL	N/A

Specific parameter language ignored:

Page URL	Crawled	Reason	Resulting Crawled URL
`www.yext.com/blog?utm_source=google`	Yes	Parameter not ignored	`www.yext.com/blog?utm_source=google`
`www.yext.com/blog?language=en&page=11`	Yes	With `language` ignored, URL is not a duplicate	`www.yext.com/blog?page=11`
`www.yext.com/blog?language=en&utm_source=google`	No	With `language` ignored, resulting URL duplicates a prior crawled URL	N/A

No query parameters ignored:

Page URL	Crawled	Reason	Resulting Crawled URL
`www.yext.com/blog?utm_source=google`	Yes	Seen as a distinct URL	`www.yext.com/blog?utm_source=google`
`www.yext.com/blog?language=en&page=11`	Yes	Seen as a distinct URL	`www.yext.com/blog?language=en&page=11`
`www.yext.com/blog?language=en&utm_source=google`	Yes	Seen as a distinct URL	`www.yext.com/blog?language=en&utm_source=google`

Max Depth (Domain Source Only)

The number of levels past the start URL that the crawler will spider to.

	Value
Default	10
Minimum	0
Maximum	100

For example, if the start URL is www.yext.com/blog (depth = 0) and Max Depth is set to 1, the crawler will spider to pages directly linked from www.yext.com/blog (e.g., www.yext.com/blog/2023), but will not continue to pages linked from those results.

If a URL appears at multiple depths during a crawl, the system uses the depth of the first instance encountered.

Crawler Status and Performance

Page	What It Shows
Crawler Overview Page	Status bar with details for the most recent crawl. The Pages tab lists all unique pages crawled across all crawls; details for a given page reflect its most recent crawl.
Crawl Details Page	Details for a single crawl, listing all pages crawled in that run.

Using the Crawler as a Connector Source

The connector pulls data from the most recently completed crawl of the selected crawler.

File Types

Select the types of files to bring in via the connector: HTML, PDF, or both.

This setting does not have to match the file type setting of the crawler itself, but the connector is limited to only the file types that were actually crawled.

Example: If a crawler is configured to crawl both HTML and PDF files, a connector using that crawler can be set to ingest only HTML files. However, if the crawler was set to scrape only HTML files, a connector configured to scrape only PDFs would return no data.

URLs

Option Behavior Details

All URLs Crawled Ingests all URLs present in the most recently completed crawl.

Specific URLs or URL Patterns

Specify a comma-separated list of exact URL paths or wildcard patterns. Only matching URLs present in the most recent crawl will be ingested.

For example, if https://faqs.yext.com is specified:

https://faqs.yext.com/blogs/1 will be included
https://pages.yext.com/blogs/1 will not be included

Page Type

Option	Behavior
Detail Page	Each page corresponds to the data for a single entity.
List Page	Multiple entities are contained on a single URL. Only supported for HTML file types (see PDF Support below). The entity container is specified by a CSS or XPath expression pointing to the outer container that includes all the information to extract for each entity.

PDF Support

If the connector's File Type setting includes PDFs (i.e., PDF Only or PDF and HTML), the Page Type must be set to Detail Page.

When data is extracted from a PDF, each PDF is treated as a single entity. The data is read as field metadata, and the entire body of the PDF is ingested as unstructured content.

HTML-specific selectors (CSS or XPath) are not compatible with PDFs. If a connector is configured to support both HTML and PDF file types, these selectors can still be added — but the corresponding fields will be blank for any PDFs.

Selectors

Selector	HTML	PDF
CSS Path	Yes	No
XPath	Yes	No
Page ID	Yes	Yes
Page URL	Yes	Yes
Page Title (from `<title>` tag)	Yes	Yes
Cleaned Body Content	Yes	Yes
Author	No	Yes
Created Date	No	Yes

If both HTML and PDF file types are selected, all HTML-specific selectors will be available but will return blank values for PDFs. The Author and Created Date selectors will not be available in this combined mode.

System Limits

Resource	Limit
PDF file size	50 MB
PDF body content ingested as body field	1 MB of text. If exceeded, content is truncated to the first 1 MB.
Max Depth	0 to 100 levels past the root URL
Rate Limit	1 to 15,000 concurrent tasks
URLs spidered per crawl	100,000 maximum. The crawler stops once this limit is reached.
Domain	768 characters maximum
Crawler Name	255 characters maximum
URL Pattern	1 to 2,000 characters
Page Load Wait Time	10 seconds. The crawler waits this long to load the page and execute JavaScript before extracting HTML.