Walkthrough: Configuring an advanced web crawler

The Sitecore Search advanced web crawler is a powerful crawler that crawls your content and adds it to an index. It can handle complicated use cases like needing authentication to access your source content, creating index documents in multiple languages, using Java Script to extract attribute values, and more.

This walkthrough describes how to:

  • Create an advanced web crawler source

  • Configure advanced web crawler settings

  • Configure a request trigger

  • Create an XPath document extractor

  • Schedule scans

  • Publish the source

Create an advanced web crawler source

Before you configure a source, you must create one.

In this example, select Web Crawler (Advanced) in step 5.

Configure advanced web crawler settings

Configure crawler settings to define important high-level configurations that define the scope of the API crawler.

To configure the scope of the advanced web crawler:

  1. Go to Sources and then click the source you created.

  2. On the Source Settings page, click Edit ( )next to Advanced Web Crawler Settings.

  3. Click Save.

Configure a request trigger

Configure triggers to give the advanced web crawler a starting point to look for content to index.

In this example, you use a sitemap as the trigger.

Note

Depending on your implementation, you can also use a sitemap index, a request, a JavaScript function, or an RSS feed as the trigger.

Create an XPath document extractor

Configure a document extractor to specify how to extract attribute values.

Note

For an advanced web crawler, you can also configure a CSS document extractor or a Javascript document extractor.

In this example, you use an XPath document extractor.

Schedule scans

Publish the source

You must publish the source for Search to start the first scan and index.

Do you have some feedback for us?

If you have suggestions for improving this article,