Index items
An index is a data structure that represents searchable items associated with a specific source. Examples of items include products, URLs, PDF documents, and JSON objects. An index document represents a single item in your implementation as a JSON object. For example, a 1,000-word HTML page would be mapped to one index document, and a 10-page PDF would be mapped to another index document.
Indexing is important in Sitecore Search because when a visitor makes a search query, the AI/ML engine looks in the index to identify content that matches the query. Similarly, recommendations are also based on indexes.
The topics in this section describe an index document and the types of sources you can use for your indexing needs.
Consider your entities and attributes and the intended search experiences while configuring your sources, because the configurations:
-
Determine which content is searched to return results.
-
Contain information about content types, location, accessibility, and scope.
Index documents
An index document contains important information used to generate results, such as system attributes, and a rfk_source
array that contains JSON objects of the item with properties matching the attributes of the entity type of the item.
The following code block shows the index document for a company's About us webpage:
{
"_index": "sampleindexnumber12345",
"_id": "internalID123",
"_version": 1,
"_seq_no": 10,
"_primary_term": 1,
"found": true,
"rfk_source": [
"content": {
"id": "id83723bBnjh9321",
"url": "https://www.sitecore.com/company",
"image_url": "https://wwwsitecorecom.db.net/images/3d-scene05-violet.png?",
"content_type": "others",
"title": "About us",
"description": "We’re growing at a historic rate. 2,200 employees...",
"subtitle": "Welcome to Sitecore. We help brands create human connections...",
"last_modified": "2023-09-13 17:54:34"
},
"_entity": "content"
],
"questions_and_answers": {...},
"rfk_stats": {...},
...
}
//sample index document for illustration only
You can view all your indexed items on the Content Collection page. They are sorted by entity into different collections and conform to the entity's attribute and feature configuration.
After a successful indexing run, it takes a few minutes for items to appear or to be refreshed in the collection.
Types of sources
A source is a mechanism that Search uses to index searchable items for visitors. Typical implementations have multiple sources to access content in different locations and in different formats, such as HTML pages, PDFs, and JSON objects.
Sitecore Search offers two type of mechanisms for indexing content - pull sources and push sources. The type of source you select depends on how the items can be accessed, how often they change, and how much they change.
Pull sources
With a pull source, Sitecore Search provides a crawler to scan and index your content, but you define rules that determine which content gets indexed and how to extract attributes. You can think of indexing content using a pull source as a collaborative effort between you and Search.
To make incremental updates with a pull source, a developer can use the Ingestion API.
Search offers the following pull sources:
-
Web crawler - a basic crawler that crawls your content by starting from a point and following hyperlinks or by going through a sitemap. The web crawler is straightforward, easy to configure, and requires no coding.
-
Advanced web crawler - a powerful and highly customizable crawler that supports complex use cases such as handling authentication requirements, handling localized content, using JavaScript to extract attribute values.
-
API crawler - a crawler specifically designed fo API endpoints that return JSON. It supports complex use cases such as handling authentication requirements, handling localized content, using JSONPath or JavaScript to extract attribute values.
Push sources
Search offers a single type of push source: the API push source, which creates an index to receive index documents.
To make incremental updates with a push source, a developer must use the Ingestion API via an API push connector.