The Crawler is a data connector that scrapes webpages for HTML content. Once a crawl has successfully run on a set of webpages, the crawler can help convert that raw HTML content into entities in the Yext platform. This article covers how to create a crawler.
Before setting up a crawler for your website, you need to ensure that it is properly whitelisted to access your web pages. We ask that you both whitelist our crawler’s user agent and IP addresses.
User Agent
The Yext Crawler uses the following user agent:
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/87.0.4280.88 YextBot/Java Safari/537.36
IPs
The Yext Crawler uses the following IP addresses:
- 54.204.19.87
- 50.19.160.200
- 34.198.218.97
- 54.221.171.225
The following IP addresses should be used in the EU:
- 35.240.80.184
- 35.241.216.126
- 35.195.140.58
To create a Crawler:
- Click Content in the navigation bar and click Configuration.
- Click Crawlers.
- Click on the + New Crawler button.
- Enter a name for your Crawler.
- Click Weekly and select the schedule of how often you would like the crawler to run: Once, Daily, or Weekly.
- Click Sub Pages and select your Crawl Strategy.
- Your Crawl Strategy is where you indicate whether you want to crawl all pages, sub-pages, or specific pages.
- Select the File Types that the crawler should crawl.
- Enter the pages or domains you would like to crawl. To add additional pages or domains click on the + Add Another link.
- Note that domains and any pages that can be spidered to the same domain will be crawled
- (Optional) Add the domains you would like to exclude from the crawl.
- (Optional) Specify the Rate Limit or Max Depth for the crawler.
- Once you save your Crawler you will return to the Crawlers page. To view the details of the Crawl you just created click on the View Details button.
Comments
0 comments
Article is closed for comments.