Walkthrough: Configuring an advanced web crawler
The Sitecore Search advanced web crawler is a powerful crawler that crawls your content and adds it to an index. It can handle complicated use cases like needing authentication to access your source content, creating index documents in multiple languages, using Java Script to extract attribute values, and more.
This walkthrough describes how to:
-
Create an advanced web crawler source
-
Configure advanced web crawler settings
-
Configure a request trigger
-
Create an XPath document extractor
-
Schedule scans
-
Publish the source
Create an advanced web crawler source
Before you configure a source, you must create one.
In this example, select Web Crawler (Advanced) in step 5.
Configure advanced web crawler settings
Configure crawler settings to define important high-level configurations that define the scope of the API crawler.
To configure the scope of the advanced web crawler:
-
Go to Sources and then click the source you created.
-
On the Source Settings page, click Edit (
)next to Advanced Web Crawler Settings.
-
Click Save.
Configure a request trigger
Configure triggers to give the advanced web crawler a starting point to look for content to index.
In this example, you use a sitemap as the trigger.
Depending on your implementation, you can also use a sitemap index, a request, a JavaScript function, or an RSS feed as the trigger.
Create an XPath document extractor
Configure a document extractor to specify how to extract attribute values.
For an advanced web crawler, you can also configure a CSS document extractor or a Javascript document extractor.
In this example, you use an XPath document extractor.
Schedule scans
Publish the source
You must publish the source for Search to start the first scan and index.