Indexing content

In Sitecore Search, a source is a configuration that defines the content you want to index, making it searchable by visitors.

Indexing is important because Search does not connect to your original content when a visitor makes a search query. Instead, Search looks in indexes and applies its algorithm to identify content that matches the query.

You can configure a source to access content in different locations and in different formats, such as HTML pages, PDFs, Microsoft Office documents, and JSON objects. Your search implementation can have multiple sources.

We've provided some guidelines to help you choose the best type of source configuration for your content.

Important

When you configure sources, remember that you are determining which content Search examines to return results. Consider your entities and attributes and the intended search experiences. Doing this ensures that you can configure the correct number of sources and define the appropriate scope of each source.

Sources and indexes

You won't directly interact with the indexes or index documents created by Sitecore Search, but understanding what they are and how they work can be useful when configuring a source.

The following diagram illustrates how your original content, indexes, and index documents relate to each other:

 

An index is a data structure representing content items that are associated with a specific source. For example, a content item could be a URL, a PDF document, a Microsoft Word document, or a JSON object.

An index document represents a single content item. For each locale, there is a 1:1 mapping between index documents and content items. For example, a 1,000-word HTML page is mapped to one index document, and a 10-page PDF is mapped to one index document.

Every index document is a JSON object, with each property representing the name of a content item's attribute and the associated value.

For example, the index document for your company's About us page might have the following structure:

RequestResponse
{
    "_index": "sampleindexnumber12345",
    "_id": "internalID123",
    "_version": 1,
    "_seq_no": 10,
    "_primary_term": 1,
    "found": true,
    "rfk_source": [
        "content": {
            "id": "id83723bBnjh9321",
            "url": "https://www.sitecore.com/company",
            "image_url": "https://wwwsitecorecom.db.net/images/3d-scene05-violet.png?",
            "content_type": "others",
            "title": "About us",
            "description": "We’re growing at a historic rate. 2,200 employees...",
            "subtitle": "Welcome to Sitecore. We help brands create human connections...",
            "last_modified": "2023-09-13 17:54:34"
        },
        "_entity": "content"
    ],
    "questions_and_answers": {...},
    "rfk_stats": {...},
...
}
//sample index document for illustration only

The rfk_source array is the main part of an index document, containing attributes and their corresponding values.

In addition to the attributes that you configure a source to extract, index documents contain other important information that Search uses to generate results. This includes the internal attribute rfk_stats, related questions and answers, and more. You can view some of these details in the RFK Details section of content items in the Content Collection. Sitecore Search populates these items based on your attribute and feature configuration.

Note

You can view all your searchable content on the Content Collection page of the Customer Engagement Console (CEC). When you index content, it will take a few minutes for it to appear in the content collection.

Types of sources

Sitecore Search offers two methods for indexing content - pull sources and push sources.

  • With a pull source, Sitecore Search provides a crawler to scan and index your content, but you define rules that determine which content gets indexed and how to extract attributes. You can think of indexing content using a pull source as a collaborative effort between you and Search.

    Note

    When you use a pull source, you can also make incremental updates by having a developer use the Ingestion API to add or modify index documents.

  • With a push source, you create an empty index and then have a developer use an API to create index documents. With this, you have full autonomy over indexing content. You're responsible for creating each index document and populating it with the correct attribute and attribute values.

Note

For most implementations, you can use pull sources to index the bulk of your content and then use a push source for special cases.

Types of pull sources

Search offers the following pull sources:

  • Web crawler, a basic crawler that crawls your content by starting from a point and following hyperlinks or by going through a sitemap. The web crawler is straightforward, easy to configure, and requires no coding.

  • Advanced web crawler, a powerful and highly customizable crawler that supports complex use cases like handling authentication requirements, handling localized content, using JavaScript to extract attribute values, and more.

  • API crawler, a crawler specifically designed to crawl API endpoints that return JSON. It supports complex use cases like handling authentication requirements, handling localized content, using JSONPath or JavaScript to extract attribute values, and more.

Types of push sources

Search offers a single type of push source: the API push source , which creates an index to receive index documents. A developer can use the Ingestion API to push index documents to an index.

Do you have some feedback for us?

If you have suggestions for improving this article,