Configure an advanced web crawler

The Sitecore Search advanced web crawler is a powerful crawler that crawls your content and adds it to an index. It can handle complicated use cases like needing authentication to access your source content, creating index documents in multiple languages, using JavaScript to extract attribute values, and more.

Important

Refer to detailed specifications of crawler types when choosing an indexing method.

Follow these best practices to successfully index best practices or crawl updates .

This walkthrough describes how to:

Note

If your original content requires authentication before the crawler can access it, configure crawler authentication settings. For example, your original content might need a GUI-based user name and password, or an access token or key in the request header. You can do this at any point after source creation.

Create an advanced web crawler source

To create a source:

On the menu bar, click Sources.
Click Add Source.
In the SOURCE NAME field, enter a name for the source.
In the DESCRIPTION field, enter a few lines to describe the source you want to configure.
In the CONNECTOR drop-down list, click Web Crawler (Advanced).
Click Save. If there are no errors, Search creates a new source.

Configure crawler settings

Configuring web crawler settings includes defining allowed domains and max URLs among other settings.

To configure crawler settings:

On the menu bar, click Sources and then click the source you created.
In the left pane, click Web Crawler Settings then in the Web Crawler Settings section, click Edit.
Optionally, in the ALLOWED DOMAINS field, enter the domains for the crawler to stay within. Do this to ensure that the web crawler only crawls Sitecore documentation domains, not external sites that might be linked.

For example, enter www.doc.sitecore.com.
To configure the depth and number of URLs for the crawler to crawl:
- In the MAX DEPTH field, enter the maximum number of levels that you want the crawler to follow for a URL.
  
  For example, enter 5.
- In the MAX URLS field, enter the maximum number of URLs for the crawler to crawl, in total. Enter a large number to ensure that the crawler leaves no URL out.
  
  For example, enter 5000.
To exclude certain URL patterns from the crawler's scope, click Add Exclusion Pattern. Then, in the TYPE drop-down menu, select Glob expression or Regular Expression. In the VALUE field, enter the expression to match URLs to exclude.

For example, to prevent the crawler from crawling your search page, enter the following Glob expression:
**/search/**
To configure the number of workers to crawl in parallel and an optional delay between requests:
- Define the number of threads, or workers, that concurrently crawl and index content by clicking a value in the PARALLELISM (WORKERS) drop-down menu.
  
  For example, enter 2 to have only two workers crawl in parallel. This uses less memory than the default 5 workers.
- Optionally, if you have configured only one worker, you can define a time for the crawler to wait before it accesses the next URL to index. To do this, in the DELAY (MS) field, enter time in milliseconds.
  
  For example, enter 3.
In the TIMEOUT field, enter the time, in milliseconds, that the crawler waits to get a response.

For example, enter 5000. This ensures that the crawler waits for 5000 milliseconds, or 5 seconds, to get a response from every URL it crawls.
Optionally, to add headers, click Add Header. Then, in the Key field, enter the name of the user agent that your content expects. In the Value field, enter the value of the user agent that your content expects. This security measure ensures that only the Search crawler, and not any other crawler, can crawl the data.

For example, enter user-agent as the Key and sitecorebot as the Value.
Optionally, if you want to stop the crawler from accepting navigation cookies, turn off ENABLE NAVIGATION COOKIES DURING CRAWLING in the Additional Settings section.

Navigation cookies track the crawler's path and record the URLs it visits. Sometimes, cookies might mislead the crawler into re-indexing previously visited URLs. However, cookies are important for websites that need them for subsequent access, especially websites that need authentication.
Optionally, if you want the crawler to wait for and crawl the JavaScript on the page in addition to the page source, turn on Render JavaScript in the Additional Settings section,
Click Save.

Configure a trigger

Configuring triggers for an advanced web crawler give it a starting point to look for content to index. In addition to using the request trigger, as described in this procedure, you can use the following types of triggers:

JavaScript
RSS
Sitemap
Sitemap Index

Note

Sitemap triggers leverage the last modified date (lastmod) field for each URL in your sitemap. This increases the indexing speed and improves the efficiency of your content updates.

To configure a request trigger:

On the Source Settings page, click Edit next to Triggers.
Click Add Trigger.
In the Trigger Type drop-down field, select Request.
Optionally, in the Body field, enter the body of the request.
Optionally, to configure a header, click Add Header and enter values in the Key and Value fields.

For example, enter user-agent as the Key and sitecorebot as the Value.
Optionally, in the Method drop-down menu, click POST, PUT, or PATCH. By default, GET is selected.
In the URL field, paste the URL you want to use as the trigger.

For example, paste https://dev.sitecore.net/

Create a document extractor

Configuring a document extractor specifies how to extract values for attributes of one or more than one entity.

In addition to the XPath document extractor type described in this procedure, you can use:

CSS document extractor
JavaScript document extractor

To create an XPath document extractor:

On the menu bar, click Sources.
Select the advanced web crawler.
On the Source Settings page, next to Document Extractors, click Edit.
To create a document extractor, on the Document Extractors page:
- In the Name field, enter a meaningful name for this extractor.
  
  For example, enter Sitecore dev portal XPath extractor.
- In the Extractor Type drop-down menu, click XPath.
- Optionally, to ensure that this extractor's logic only applies to URLs that match a certain pattern, configure URLs to Match. To do this, in the URLs To Match field, click Add Matcher, select the TYPE of expression you want to use, and enter the VALUE of that expression.
To define how to extract attributes, in the Taggers section, click Edit next to the first tag. Usually, it is the content tag.

You see a list of some attributes and corresponding XPath extraction expressions. Search provides these sample attributes and expressions to help with configuration.
Note
In a document extractor, you can create multiple taggers where each is linked to a unique tag. This way each tagger:
Generates a set of index documents.

Can have multiple rules such that each rule defines the extraction logic of one attribute.
For example, one tagger with five rules yields one set of documents each with five attributes.

Three taggers with one rule each yields three sets of documents each with one attribute.
To delete the sample attributes you don't need in this extractor, click Delete .
To edit the configuration of a sample attribute, click Edit , and then make changes.
To configure how to extract the value of a new attribute, click Add Rule and enter the following details:
- In the Attribute drop-down menu, select the attribute you want to configure.
  
  For example, click Abstract.
- In the Value type drop-down menu, choose whether you want the attribute value to be a fixed value or an expression.
  
  For example, click Expressions.
- In the EXPRESSION field, enter an XPath expression that results in the attribute value.
  
  For example, if you want to get the value of an attribute called abstract from the text of all p elements contained within class attributes that contain the word abstract, enter the following XPath expression:
  //*[contains(@class, "abstract")]//p
Optionally, to configure more than one way to get an attribute value, click Add Selector. Then, in the Expression field of the second selector, enter the XPath expression that results in the attribute value.

For example, as a second option, you want to get the abstract from the text of all p elements that are within the divclass of topic-content. To do this, enter the following XPath expression:
//div[@class="topic-content"]//p
When there are multiple selectors, Search runs them in chronological order and stops when it arrives at an expression that gives a result.
In the tag editor, click Save. Then, on the Document Extractors page, click Save.
Optionally, to extract attributes for another tag, click Add Tagger, and in the Tag drop-down menu, click a tag. Then, repeat Steps 7 through 9.
On the Document Extractors page, click Save.

Schedule the crawler

To schedule the crawler:

In the left pane, click Crawler Scheduler then in the Crawler Scheduler section, click Edit.
In the STARTS drop-down list, click an option from the following:
- Anytime - if you want the schedule to start as soon as possible.
- Specific Date - if you want the crawl to start on a particular date. You also need to select a date in the date picker.
(Optional) In the REPEAT drop-down list, to crawl only once, click DOES NOT REPEAT then, to close the page, click Save.
To run the crawler on a schedule, in the REPEAT drop-down menu, click Yes.
In the drop-down lists for Repeats every, to define the frequency of crawls:
- In the repeat count drop-down list, click a value from 1 to 99.
- In the interval drop-down list, click a value from the following: hours, days, and weeks.
If you chose to start the schedule on a specific date in step 2, to define the time at which the crawl starts, in the RUN TIME drop-down list, click a start time.
In the END DATE drop-down menu, to configure the end of the crawler schedule, click an option from the following:
- Never - if you want the schedule to continue indefinitely.
- Specific Date - if you want the schedule to stop on a particular date. You also need to select a date in the date picker.
Click Save.

Publish updates to the source

To start the first crawl and index items, you need to publish the source. You also need to publish the source every time you make any changes to the source.

To publish a source:

On the menu bar, click Sources.
Click the source you want to publish and click Publish.
In the Publish Source dialog, if you want Search to start a recrawl for this source, select the Trigger source recrawl after publishing check box.
Click Publish.

If you have suggestions for improving this article, let us know!