Configure a web crawler

The Sitecore Search web crawler is a source that crawls your content through the hyperlinks to create index documents. The web crawler is straightforward, easy to configure, and requires no coding. A web crawler indexes HTML pages and PDFs, found at the end of a hypertext link.

The web crawler begins from a starting point called a trigger. If your starting point is a sitemap or sitemap index, the crawler crawls all URLs in the sitemap or sitemap index. If your starting point is a single URL, called a request, the crawler follows hyperlinks from those pages to new pages.

Important

Refer to detailed specifications of crawler types when choosing an indexing method.

Follow these best practices to successfully index best practices or crawl updates .

This walkthrough describes how to:

In this example, you extract values for the author, name, and type attributes.

Create a source for a web crawler

To create a source for a web crawler:

  1. In the CONNECTOR drop-down list, click Web Crawler.

Configure crawler settings

Configuring web crawler settings includes defining a crawl trigger and max URLs among other settings.

To configure crawler settings:

  1. On the menu bar, click Sources and then click the source you created.

  2. In the left pane, click Web Crawler Settings then in the Web Crawler Settings section, click Edit.

  3. To configure a trigger for the web crawler, in the TRIGGER TYPE drop-down list, choose one of the following options: Request, Sitemap, and Sitemap Index.

  4. In the URL or URLS field, enter the URL of the sitemap.

    For example, enter https://doc.sitecore.com/search/sitemap.xml.

    Note

    If you select a sitemap or sitemap index, you can add more than one URL.

  5. Click Save.

Extract values for attributes

The web crawler offers different methods to extract values for your attributes. The method you choose, depends on where the value is, and how it can be accessed.

Note

To extract values for multiple entities in a single crawl, you need to convert the web crawler to an advanced web crawler. A web crawler can extract values for only one entity.

For an attribute, to set a value that is:

  • Extracted from the meta tags in the header section of a webpage.

  • Extracted by traversing an XML document.

  • Identical in all items indexed by the crawler.

To simplify configuration, the web crawler is preconfigured to use the XPath method to extract values for type, url, name, and description. You can always change the method by choosing a different type in the EXTRACTION TYPE drop-down list.

To edit or configure extraction of values for all attributes in the entity, use the procedures described in the following tabs.

Validate extraction logic

To check that the attribute extraction logic you defined results in the attribute values you expect, you can validate your configuration.

To validate attribute extraction logic:

  1. In the left pane, click Attribute Extraction then in the Attribute Extraction section, click Edit.

  2. On the menu bar at the top of the page, click Validate.

  3. In the validation window, in the VALIDATION URL field, enter a URL from the content you want indexed.

    For example, if you're configuring a source to index all content from www.bank.com, you can enter www.bank.com/commercial/loans as a sample URL to test validation.

  4. Optionally, click Add Validation URL and enter further URLs.

    For example, continuing the example of www.bank.com, you can enter sample URLs from the commercial, personal, and mortgage sections of the website. Or, if you know that one section of your website has many images and another has videos, you can enter one URL from each section.

    Note

    We recommend entering multiple URLs to ensure that your extraction logic works across all the content to be indexed. Enter URLs that have different page structures and content types.

  5. Click Validate.

    Beneath each URL, you see a list of attributes and corresponding values.

    Note

    If you see an error next to an attribute, you'll know that the current extraction logic does not work for that URL, and you might need to edit it.

  6. To close the validation window, click Close .

  7. Optionally, update the attribute extraction logic and repeat steps 2 through 5 to check if the new logic returns all attribute values for all sample URLs.

  8. Click Save.

Schedule the crawler

To schedule the crawler:

Publish updates to the source

To start the first crawl and index items, you need to publish the source. You also need to publish the source every time you make any changes to the source.

Do you have some feedback for us?

If you have suggestions for improving this article,