Configure an advanced web crawler

The Sitecore Search advanced web crawler is a powerful crawler that crawls your content and adds it to an index. It can handle complicated use cases like needing authentication to access your source content, creating index documents in multiple languages, using JavaScript to extract attribute values, and more.

This walkthrough describes how to:

Note

If your original content requires authentication before the crawler can access it, configure crawler authentication settings. For example, your original content might need a GUI-based user name and password, or an access token or key in the request header. You can do this at any point after source creation.

Create an advanced web crawler source

Before you configure a source, you must create one.

Note

In this example, select Web Crawler (Advanced) in step 5.

To create a source:

  1. Here, click Web Crawler (Advanced).

Configure advanced web crawler settings

Configure crawler settings to define important high-level configurations that define the scope of the API crawler.

To configure the scope of the advanced web crawler:

  1. On the menu bar, click Sources and then click the source you created.

  2. On the Source Settings page, click Edit next to Advanced Web Crawler Settings.

  3. Click Save.

Configure a request trigger

Configure triggers to give the advanced web crawler a starting point to look for content to index.

In this example, you use a sitemap as the trigger.

Note

Depending on your implementation, you can also use a sitemap index, a request, a JavaScript function, or an RSS feed as the trigger.

Create an XPath document extractor

Configure a document extractor to specify how to extract attribute values.

Note

For an advanced web crawler, you can also configure a CSS document extractor or a Javascript document extractor.

You can also extract values of attributes from more than one entity.

In this example, you use an XPath document extractor.

Schedule scans

Publish the source

You must publish the source for Search to start the first scan and index.

Do you have some feedback for us?

If you have suggestions for improving this article,