Configure indexing

In Sitecore Search, after choosing the type of indexing methods you want to use, you need to configure a source for each type.

This topic introduces and links to walkthroughs describing how to configure different source types for some specific content types: web crawlers, API web crawler, API push, PDFs, localized crawlers, and localized API push source.

Tip

You can see this video overview about how to configure a source in Sitecore Search.

While specifics and complexity differ, all sources have these broad configuration elements:

  • What content to index - called a trigger. Use this setting to tell Search exactly what content you want indexed. For example, you can tell Search to index all links on a sitemap or to start from a specified URL and then follow hyperlinks.

  • How to index content - called a document extractor. Use this setting to tell Search how to extract bits of information from your original content and assign them as attribute values. For example, for the value for the abstract attribute, you can tell Search to use the XPath expression //*[contains(@class, "abstract")]//p to extract the value of the first <p> that is a descendant of an element with a class abstract.

  • How to update indexed content - you can configure a crawler schedule to ensure that index documents reflect the latest version of your original content. Additionally, if you want to make incremental changes to index documents, you can enable incremental updates. This allows developers to use the Ingestion API to update index documents.

  • Other settings - these are optional configurations that depend on your business requirements.

Note

For your sources, you'll need to configure:

  • A trigger to start an indexing job.

  • A schedule for crawlers to ensure the index is up to date.

  • Document extractors to parse metadata from your content.

The following are optional settings that depend on your business requirements:

  • General crawler settings - a suite of settings that includes, for example, domains that can be crawled, URLs to avoid, and the user agent to use. Sitecore Search provides default values that typically suffice for most initial setups, although you can change these settings to meet your requirements.

    For example, you can change the maximum URLs to be crawled from 1000 to 3000.

  • Locale-related settings - if you need to index content across more than one locale you must define the available locales for this source. You'll also need to define a locale extractor, a configuration that tells Search how to get the locale from each URL. This is important because each index document needs to have a locale attached to it.

    For example, you can add ja-JP and fr-FR as available locales and then configure how to extract these locales from your content.

  • Authentication-related settings - if your content requires authentication before it can be accessed, you'll need to configure Crawler authentication settings.

    For example, if your content needs a username and password before it can be accessed, configure browser authentication.

  • Tags - the ability to create custom tags, an entity-based configuration that you can use to specify exactly which attributes you want a set of index documents to have. By default, Sitecore Search creates one tag per entity. You can create more tags.

    For example, to manage a website with both internal and external blog content, you can create separate tags for each type under the Blog entity.

  • Additional content to index - if you find that the trigger does not cover all the items you want to index, you can define a request extractor, a configuration that tells Search how to generate additional URLs to crawl.

    For example, if you use an API crawler where the trigger returns JSON, you can define a request extractor that uses returned JSON to generate API endpoints (URLs).

Important

Refer to detailed specifications of crawler types when choosing an indexing method.

Follow these best indexing practices to successfully index your searchable content.

Do you have some feedback for us?

If you have suggestions for improving this article,