Configure crawler settings

Crawler settings define the scope of the source. This includes the domains the crawler can crawl, the URLs it must avoid, the maximum URL depth you want the crawler to reach, and more.

Note

To index data from a single website, the best practice is to use a single crawler. To index data from articles, pages, PDFs, and other document types that might store metadata differently, we recommend that you apply multiple document extractors to the crawler.

Use the following settings to configure the crawler:

Setting	Description
ALLOWED DOMAINS	Domains you want the crawler to crawl and index. Usually, enter the top-level domain that you want the crawler to crawl. For example, if you want the crawler to index only Sitecore documentation, enter doc.sitecore.com as the allowed domain instead of www.sitecore.com. Defining allowed domains prevents the crawler from indexing any third-party content. For example, assume your original content has blogs that link to external social media sites. If you define an allowed domain, the crawler indexes the blog page, ignores the links to the social media sites, and moves to the next URL. You can add more than one allowed domain. Default: None. This means that the crawler crawls all domains that it encounters.
MAX DEPTH	Maximum number of links the crawler follows in a single URL. For example, you have a max depth of three. The crawler starts at www.sitecore.com, depth level one, then goes to www.sitecore.com/products, depth level two, and then to www.sitecore.com/products/content-cloud, depth level 3. The crawler does not open hyperlinks on the last page because it has reached the maximum depth of three. Defaults: 0 for the web crawler source with a sitemap or sitemap index trigger. 2 for all other sources. Limit: `5`
MAX URLS	Maximum number of URLs to crawl, in total. We recommend entering a large number, like 100000 for MAX URLS. Always set to be greater than your estimate for URLs. This prevent a crawl being stopped before completion. Defaults: 10000 for the API crawler. 1000 for all other crawlers.
EXCLUSION PATTERNS	Glob or regular expression that defines the URLs you don't want the crawler to index. You can create multiple exclusion patterns. Default: None
PARALLELISM (WORKERS)	Number of threads, or workers, that concurrently crawl and index content. More workers index content faster but also use more resources. Default: 1 Limit: `10`
DELAY (MS)	Time, in milliseconds, the crawler needs to wait before accessing the next URL to index. You can use this setting to regulate the crawler's requests, preventing it from potentially overloading the resources that host your original content. Note To define a delay, set PARALLELISM (WORKERS) to 1. You cannot set a delay when there are multiple workers because each worker operates independently. Default: 0, or no delay. Limit: `5000ms`
TIMEOUT	Time, in milliseconds, the crawler waits to get a response from each URL or document. When the timeout period expires, the crawler does not index that URL or document but moves on to the next one. Defaults: 60,000 for a web crawler with a sitemap or sitemap index trigger. 10000 for all other crawlers. Limit: `180,000ms` (3 minutes)
HEADERS	User agent that the crawler uses when crawling your website. For security, set a user agent for the crawler and whitelist this user agent. This prevents other bots from crawling your site. Enter values for key and value. For example, enter the key as user-agent and the value as `sitecorebot`. Default: Empty.
ENABLE NAVIGATION COOKIES DURING CRAWLING	Control whether the crawler accepts navigation cookies set by websites. These cookies track a user's (here, the crawler) path and record visited URLs. One reason you might want to disable these cookies is so that the crawler does not get misled into re-indexing previously visited URLs. Another reason is that the crawler might not index content across multiple locales since cookies can limit the crawler to the first locale it encounters. Tip If you index content in multiple locales, we recommend disabling navigational cookie acceptance in the crawler. On the other hand, you might want to enable navigation cookies for websites that need them for subsequent access, especially websites that need authentication. Default: Enabled. The crawler accepts navigational cookies.
Render Javascript	Whether you want the crawler to render JavaScript on the page. If you turn on Render JavaScript, the crawler waits for JavaScript to render and indexes that content in addition to the page source. If you do not turn on Render JavaScript, the crawler only indexes content in the page source. If you turn on Render Javascript, you must define some additional settings. Default: `2000ms` Limit: `60,000ms` (1 minute)

If you have suggestions for improving this article, let us know!