Sources
Introduction to content sources for Sitecore Search
In Sitecore Search, you index content by creating a source.
A source is a configuration that defines which content you want to make searchable in a Sitecore Search experience. You can configure a source to access content in different locations, such as on a website or in a database, and different formats, such as HTML pages, PDFs, Microsoft Office documents, and JSON objects.
An important part of source configuration is configuring how to extract values of the attributes you previously defined. If you do not extract attribute values, you cannot use these attributes to create search experiences. For example, even if you have created attributes for title, description , and review_rating, you cannot show a Review Ratings facet or a Title and Description for items in the search results page unless you extract values for these attributes from your original content.
When you configure and publish a source, Search creates an index of content items. Then, at runtime, when a visitor performs a search, the Search engine looks up the user's query in the index to identify content that matches the query. You can configure pull sources, or crawlers, to go and look for and index content. Additionally, you can use the Ingestion API to push content into an index.
You can create multiple sources within a single Search domain. For example, your implementation might require three advanced web crawler sources and two API crawler sources. Each source has a source ID that you pass at run time to configure different search experiences.
For example, if you want to search and recommend against all sources in your company's main search page, pass all source IDs at run time. If you need only a subset of sources for the search page of your company's developer section, pass a subset of source IDs at run time.
We have provided some guidelines to help you choose the best type of source configuration for your content.
Sources and indexes
Search creates indexes and index documents behind the scenes, so you do not see an actual index or index document. However, understanding these terms can be helpful when you configure a source because it helps you visualize how your configuration enables Search.
Here's a representation of how your original content, indexes, and index documents relate to each other:

An index is a data structure that stores a representation of all content items that fall under the scope of a source. Sources and indexes have a 1:1 mapping, which means that for every source you create, Search creates one index.
A content item could be a URL, a PDF document, a Microsoft Word document, or a JSON object, for example. Search creates an index document from a content item as long as it can extract all required attributes. By default, the id and content_type are required, but according to your implementation, you can mark other attributes as required.
An index document exists within an index and represents a single content item. In other words, there is a 1:1 mapping between index documents and content items. For example, a 1000-word HTML page becomes one index document, and a 10-page PDF becomes one index document.
Index documents are JSON objects where the key:value
pairs represent important information about a content item. As the crawler crawls, it creates an index document for each content item it encounters and adds the index document to the index of that source. The crawler uses the attribute extraction rules you configure to populate key:value
pairs that describe the content item.
For example, the index document for your company's About us page might have the following key:value
pairs:
id : id82723bBcjh9327
url : https://www.sitecore.com/company
image_url : https://wwwsitecorecom.db.net/images/3d-scene05-violet.png?
content_type : others
title : About us
description : We’re growing at a historic rate. 2,200 employees...
subtitle : Welcome to Sitecore. We help brands create human connections...
last modified : 2022-09-13 17:54:34Z
You can view all your searchable content on theContent Collectionpage of the Customer Engagement Console (CEC). Note that when you index content for the first time, it takes a few minutes for the content collection to refresh after the crawler completes indexing.
Push and pull sources
Sitecore Search offers both pull and push sources.
A pull source uses a crawler to scan and index your source content. When you use a pull source, you configure rules that define what content you want to index and how you want to extract attributes for this content. Then, the crawler indexes content based on these rules.
A push source uses an API that you use to create index documents, one by one. When you use a push source, you have complete control over all aspects of each index document you create. Instead of defining general rules that cover a large area of content, you are responsible for creating each index document and populating it with attribute values.
For most implementations, you can use pull sources to index the bulk of your content and then use a push source for one-off cases.
Search offers the following pull sources:
-
Web crawler - a tool that crawls your content by starting from a point and following hyperlinks. For each hyperlink it comes across, the web crawler creates an index document. The web crawler is straightforward, easy to configure, and requires no coding.
-
Advanced web crawler - a powerful and highly customizable crawler that crawls your content and adds it to an index. It supports complex use cases like handling authentication requirements, handling localized content, using JavaScript to extract attribute values, and more.
-
API crawler - a crawler specifically designed to crawl API endpoints that return JSON. It supports complex use cases like handling authentication requirements, handling localized content, using JSONPath or JavaScript to extract attribute values, and more.
Search offers the following push sources:
-
Ingestion API - a RESTful API that you can use to create, update, or delete individual index documents.
NoteYou cannot use the Ingestion API on its own. You have to create a crawler source or an API push source so that you have an index that the Ingestion API can work with.
-
API push source - a source that creates a placeholder index so that you can use the Ingestion API to add, update, or delete individual documents that are not covered by any other source, that is, not in any other index.
NoteThe API push source does not create index documents. After you create an API push source, you have to use the Ingestion API to add, update, or delete documents to the API push source's index.