Configure an advanced web crawler
The Sitecore Search advanced web crawler is a powerful crawler that crawls your content and adds it to an index. It can handle complicated use cases like needing authentication to access your source content, creating index documents in multiple languages, using JavaScript to extract attribute values, and more.
This walkthrough describes how to:
If your original content requires authentication before the crawler can access it, configure crawler authentication settings. For example, your original content might need a GUI-based user name and password, or an access token or key in the request header. You can do this at any point after source creation.
Create an advanced web crawler source
Before you configure a source, you must create one.
In this example, select Web Crawler (Advanced) in step 5.
To create a source:
-
Here, click Web Crawler (Advanced).
Configure advanced web crawler settings
Configure crawler settings to define important high-level configurations that define the scope of the API crawler.
To configure the scope of the advanced web crawler:
-
On the menu bar, click Sources and then click the source you created.
-
On the Source Settings page, click Edit next to Advanced Web Crawler Settings.
-
Click Save.
Configure a request trigger
Configure triggers to give the advanced web crawler a starting point to look for content to index.
In this example, you use a sitemap as the trigger.
Depending on your implementation, you can also use a sitemap index, a request, a JavaScript function, or an RSS feed as the trigger.
Create an XPath document extractor
Configure a document extractor to specify how to extract attribute values.
For an advanced web crawler, you can also configure a CSS document extractor or a Javascript document extractor.
You can also extract values of attributes from more than one entity.
In this example, you use an XPath document extractor.
Schedule scans
Publish the source
You must publish the source for Search to start the first scan and index.