Configure indexing

Configure a web crawler

The Sitecore Search web crawler is a source that crawls your content through the hyperlinks to create index documents. The web crawler is straightforward, easy to configure, and requires no coding. The web crawler indexes both HTML pages and PDFs that are accessible through hypertext links.

The web crawler begins from a starting point called a trigger. If your starting point is a sitemap or sitemap index, the crawler crawls all URLs in the sitemap or sitemap index. If your starting point is a single URL, called a request, the crawler follows hyperlinks from those pages to new pages.

Important

Refer to detailed specifications of crawler types when choosing an indexing method.

There are best practices to consider when you want to configure crawlers to index complete websites, to capture frequent new updates, and in general for indexing.

This walkthrough describes how to:

In this example, you extract values for the author, name, and type attributes.

Note

Sitemap triggers leverage the last modified date (lastmod) field for each URL in your sitemap. This increases the indexing speed and improves the efficiency of your content updates.

Create a web crawler source

To create a source for a web crawler:

On the menu bar, click Sources.
Click Add Source.
In the SOURCE NAME field, enter a name for the source.
In the DESCRIPTION field, enter a few lines to describe the source you want to configure.
In the CONNECTOR drop-down list, click Web Crawler.
Click Save. If there are no errors, Search creates a new source.

Configure crawler settings

Configuring web crawler settings includes defining a crawl trigger and specifying the maximum number of URLs to crawl.

Note

We recommend that you specify which domains are allowed for crawling. The web crawler will otherwise follow all links from the URL starting point, which can negatively impact performance.

To configure crawler settings:

On the menu bar, click Sources and then click the source you created.
In the left pane, click Web Crawler Settings then in the Web Crawler Settings section, click Edit.
To configure a trigger for the web crawler, in the TRIGGER TYPE drop-down list, choose one of the following options: Request, Sitemap, and Sitemap Index.
In the URL or URLS field, enter the URL of the sitemap.

For example, enter https://doc.sitecore.com/search/sitemap.xml.

Note

If you select a sitemap or sitemap index, you can add more than one URL.
To configure the depth and number of URLs for the crawler to crawl:
- In the MAX DEPTH field, enter the maximum number of levels that you want the crawler to follow for a URL.
  
  For example, enter 5.
- In the MAX URLS field, enter the maximum number of URLs for the crawler to crawl, in total. Enter a large number to ensure that the crawler leaves no URL out.
  
  For example, enter 5000.
To exclude certain URL patterns from the crawler's scope, click Add Exclusion Pattern. Then, in the TYPE drop-down menu, select Glob expression or Regular Expression. In the VALUE field, enter the expression to match URLs to exclude.

For example, to prevent the crawler from crawling your search page, enter the following Glob expression:
**/search/**
In the TIMEOUT field, enter the time, in milliseconds, that the crawler waits to get a response.

For example, enter 5000. This ensures that the crawler waits for 5000 milliseconds, or 5 seconds, to get a response from every URL it crawls.
Optionally, to add headers, click Add Header. Then, in the Key field, enter the name of the user agent that your content expects. In the Value field, enter the value of the user agent that your content expects. This security measure ensures that only the Search crawler, and not any other crawler, can crawl the data.

For example, enter user-agent as the Key and sitecorebot as the Value.
Click Save.

Extract values for attributes

The web crawler offers different methods to extract values for your attributes. The method you choose, depends on where the value is, and how it can be accessed.

Note

To extract values for multiple entities in a single crawl, you need to convert the web crawler to an advanced web crawler. A web crawler can extract values for only one entity.

For an attribute, to set a value that is:

Extracted from the meta tags in the header section of a webpage.
Extracted by traversing an XML document.
Identical in all items indexed by the crawler.

To simplify configuration, the web crawler is preconfigured to use the XPath method to extract values for type, url, name, and description. You can always change the method by choosing a different type in the EXTRACTION TYPE drop-down list.

To edit or configure extraction of values for all attributes in the entity, use the procedures described in the following tabs.

Meta tag

To extract values from the meta tags in the header section of a webpage, you need to use the Meta Tag extraction method.

To extract the value for an attribute, author, using the Meta Tag method:

In the left pane, click Attribute Extraction then in the Attribute Extraction section, click Edit.
On the Attribute Extraction page, on the menu bar, click Add Attribute.
In the attribute selector, click Add next to the attribute whose value you want to extract, and then click Add at the bottom.
Next to the attribute you want to configure, in the Type drop-down menu, click Meta Tag

For example, click Meta Tag next to author.
In the VALUE field, enter the name of the meta tag or property tag you want to use.

For example, enter author.

When you do this, Search first creates and runs //meta[@name='author']/@content.

If it does not get a value from this expression, it creates and runs //meta[@property='description']/@content.
Click Save.

XPath

To extract values by traversing the elements of an XML document, you need to use the XPath extraction method.

To extract the value for an attribute, name, using the XPath method:

On the Source Settings page, click Edit next to Attribute Extraction.
Next to the attribute whose extraction you want to configure, click the Extraction Type drop-down menu, and then click XPath.

For example, click XPath next to the name attribute.
In the VALUE field, enter the XPath expression you want to use.

For example, enter //div[@class='wy-menu-vertical']/p. This sets the text within the <p> tag of the page's first <div> tag with a class value of wy-menu-vertical, as the value for the name attribute.
Click Save.

Fixed

There are times when you want to set the same value for an attribute all items indexed by this source, you need to use the Fixed extraction method.

For example, if the crawler indexes products of a particular brand Brand, you can use the Fixed method to set Brand as the value for the brand attribute in all crawled items.

To use a fixed value for for the type attribute:

In the left pane, click Attribute Extraction then in the Attribute Extraction section, click Edit.
Next to the attribute that you want to edit, click the Extraction Type drop-down menu, and then click Fixed.

For example, click Fixed next to the type attribute.
In the VALUE field, enter the constant to use for this attribute.

For example, enter commercial.
Click Save.

Validate extraction logic

To check that the attribute extraction logic you defined results in the attribute values you expect, you can validate your configuration.

To validate attribute extraction logic:

In the left pane, click Attribute Extraction then in the Attribute Extraction section, click Edit.
On the menu bar at the top of the page, click Validate.
In the validation window, in the VALIDATION URL field, enter a URL from the content you want indexed.

For example, if you're configuring a source to index all content from www.bank.com, you can enter www.bank.com/commercial/loans as a sample URL to test validation.
Optionally, click Add Validation URL and enter further URLs.

For example, continuing the example of www.bank.com, you can enter sample URLs from the commercial, personal, and mortgage sections of the website. Or, if you know that one section of your website has many images and another has videos, you can enter one URL from each section.

Note

We recommend entering multiple URLs to ensure that your extraction logic works across all the content to be indexed. Enter URLs that have different page structures and content types.
Click Validate.

Beneath each URL, you see a list of attributes and corresponding values.

Note

If you see an error next to an attribute, you'll know that the current extraction logic does not work for that URL, and you might need to edit it.
To close the validation window, click Close .
Optionally, update the attribute extraction logic and repeat steps 2 through 5 to check if the new logic returns all attribute values for all sample URLs.
Click Save.

Schedule the crawler

To schedule the crawler:

In the left pane, click Crawler Scheduler then in the Crawler Scheduler section, click Edit.
In the STARTS drop-down list, click an option from the following:
- Anytime - if you want the schedule to start as soon as possible.
- Specific Date - if you want the crawl to start on a particular date. You also need to select a date in the date picker.
(Optional) In the REPEAT drop-down list, to crawl only once, click DOES NOT REPEAT then, to close the page, click Save.
To run the crawler on a schedule, in the REPEAT drop-down menu, click Yes.
In the drop-down lists for Repeats every, to define the frequency of crawls:
- In the repeat count drop-down list, click a value from 1 to 99.
- In the interval drop-down list, click a value from the following: hours, days, and weeks.
If you chose to start the schedule on a specific date in step 2, to define the time at which the crawl starts, in the RUN TIME drop-down list, click a start time.
In the END DATE drop-down menu, to configure the end of the crawler schedule, click an option from the following:
- Never - if you want the schedule to continue indefinitely.
- Specific Date - if you want the schedule to stop on a particular date. You also need to select a date in the date picker.
Click Save.

Publish updates to the source

To start the first crawl and index items, you need to publish the source. You also need to publish the source every time you make any changes to the source.

To publish a source:

On the menu bar, click Sources.
Click the source you want to publish and click Publish.
In the Publish Source dialog, if you want Search to start a recrawl for this source, select the Trigger source recrawl after publishing check box.
Click Publish.

If you have suggestions for improving this article, let us know!