Configuring crawler settings
Description of the crawler settings that you can use to configure the scope of a Sitecore Search crawler.
Crawler settings define the scope of the source. This includes the domains the crawler can crawl, the URLs it must avoid, the maximum URL depth you want the crawler to reach, and more.
Configure the following settings to configure crawler settings:
Setting |
Description |
---|---|
ALLOWED DOMAINS |
Domains you want the crawler to crawl and index. Usually, enter the top-level domain that you want the crawler to crawl. For example, if you want the crawler to index only Sitecore documentation, enter doc.sitecore.com as the allowed domain instead of www.sitecore.com. Defining allowed domains prevents the crawler from indexing any third-party content. For example, if you have a blog that links to social media sites, the crawler stops after it indexes the blog page because the social media sites have different domains. You can add more than one allowed domain. Default: None. This means that the crawler crawls all domains that it encounters. |
MAX DEPTH |
Maximum number of links the crawler follows in a single URL. For example, you have a max depth of three. The crawler starts at www.sitecore.com, depth level one, then goes to www.sitecore.com/products, depth level two, and then to www.sitecore.com/products/content-cloud, depth level 3. The crawler does not open hyperlinks on the last page because it has reached the maximum depth of three. Defaults:
|
MAX URLS |
Maximum number of URLs to crawl, in total. Defaults:
|
EXCLUSION PATTERNS |
Glob expression that defines the URLs you do not want the crawler to index. Default: None. |
PARALLELISM (WORKERS) |
Number of threads, or workers, that concurrently crawl and index content. More workers index content faster but also use more resources. Default: 5. |
TIMEOUT |
Time, in milliseconds, the crawler waits to get a response from each URL or document. When the timeout period expires, the crawler does not index that URL or document but moves on to the next one. Defaults:
|
HEADERS |
User agent that the crawler uses when crawling your website. For security, set a user agent for the crawler and whitelist this user agent. This prevents other bots from crawling your site. Enter values for key and value. For example, enter the key as user-agent and the value as Default: Empty. |
Render Javascript |
Whether you want the crawler to render JavaScript on the page. If you turn on Render JavaScript, the crawler waits for JavaScript to render and indexes that content in addition to the page source. If you do not turn on Render JavaScript, the crawler only indexes content in the page source. If you turn on Render Javascript, you must define some additional settings. Default: Off. |