The Crawler is a Data Connector that helps scrape web pages for their HTML content. Once a crawl has successfully run on a set of web pages, the Add Data flow can help convert that raw HTML content into entities in the Yext platform. This article covers how to create a Crawler.
Before setting up a crawler for your website, you need to ensure that the Yext Crawler is properly whitelisted to access your web pages. We ask that you both whitelist our Crawler’s user agent and IP addresses.
User Agent
The Yext Crawler uses the following user agent:
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/87.0.4280.88 YextBot/Java Safari/537.36
IPs
The Yext Crawler uses the following IP addresses:
- 54.204.19.87
- 50.19.160.200
- 34.198.218.97
- 54.221.171.225
To create a Crawler:
- Click Content in the navigation bar and click Configuration.
- Click Crawlers.
- Click on the + New Crawler button.
- Enter a name for your Crawler.
- Click Weekly and select the schedule of how often you would like the crawler to run: Once, Daily, or Weekly.
- Click Sub Pages and select your Crawl Strategy.
- Your Crawl Strategy is where you indicate whether you want to crawl all pages, sub-pages, or specific pages.
- Select the File Types that the crawler should crawl.
- Enter the pages or domains you would like to crawl. To add additional pages or domains click on the + Add Another link.
- Note that domains and any pages that can be spidered to the same domain will be crawled
- (Optional) Add the domains you would like to exclude from the crawl.
- (Optional) Specify the Rate Limit or Max Depth for the crawler.
- Once you save your Crawler you will return to the Crawlers page. To view the details of the Crawl you just created click on the View Details button.
Comments
0 comments
Please sign in to leave a comment.