Index items

Search User
Index items
Configuring document extractors

Configuring document extractors

The document extractor creates an index document out of the URLs or documents in your original content. Sitecore Search then adds these index documents to the source's index. At run time, Search looks through index documents to get results.

The document extractor adds attributes and attribute values in the form of key:value pairs to each index document. When you configure a document extractor, you specify how you want Sitecore Search to extract values for each attribute. If the document extractor cannot find an attribute value, it does not add that key:value pair to the index document for that content item. If the document extractor cannot find the value for a mandatory attribute, Sitecore Search does not index that content item.

Note

For the web crawler, the document extractor is called Attribute Extraction.

Note

All non-JavaScript extractors stop after the first match they find for each attribute. JavaScript extractors execute in their entirety.

For example, you want to crawl a site with 100 URLs and you configure the crawler to get the content_type, title, description, and image_url attributes. You get 100 index documents, each with some or all of the following key:value pairs.

content_type:<content_type_value>
title:<title_value>
description:<description_value>
image_url: <image_url_value>
id: <id_value>

Later, when you integrate your website with Search, you can create the following search experiences:

Show a results page with a title, description, and image for each item.
Allow users to filter by content_type equals, say, news, or blog.
Allow users to arrange results in ascending or descending order of the title.

Unless you extract a value for an attribute, you can't use that attribute to create a search experience. In the previous example, you won't be able to show users an average_rating for each content piece even if your original content has a field for this and you have configured an average_rating attribute. This is because you did not configure the crawler to extract the average_rating attribute.

Except for the web crawler, you can create more than one document extractor for a source. Usually, you would create one document extractor for each content section that needs different URL matching rules and attribute extraction rules. For example, you have a banking website with three sections - personal, loans, and commercial, each with a different URL and metadata pattern. One way to handle this is to create one document extractor for mybank.com/personal, one for mybank.com/loans, and one for mybank.com/commercial.

Note

Mandatory attributes

Sitecore Search has to be able to extract all mandatory attributes before it creates an index document out of a URL or document. If it cannot extract the value of any mandatory attribute, it does not index that URL or document.

There are two types of mandatory attributes: attributes that are mandatory for all domains and attributes that are mandatory only for your domain.

All domains have the following mandatory attributes:

url - you have to configure how to extract the URL for each page or document.
type - you have to configure how to extract the type for each page or document.
id - Search generates and assigns a unique ID for each index document it creates. You can, however, extract and assign a value.

Note

There are two scenarios where you have to explicitly configure how to extract id: if you have localized content and if you generate more than one index document from a single URL or document. In both scenarios, ensure that all index documents of a content piece share the same ID. This is important to ensure that rules and other settings that use the index document ID apply to all index documents.

For example, your company's About Us page is available in six locales, including English (US). If you don't configure how to extract the id attribute, Search generates six different IDs for the six About Us index documents. This creates a problem when you configure anything that uses the ID of an index document. For example, many pin rules are based on pinning a content item with a specific ID to a specific slot. If you want to pin About Us, and you use the ID of the English (US) version, only users in the English (US) locale will see that content item pinned. Users in the other five locales will not see a localized version of About Us pinned. To avoid this problem, always ensure that localized versions of the same content have the same ID.

Important

Do not assign the same ID to documents across different entities because it can lead to conflicts in rules and content behavior. To preserve referential integrity, ensure that each index document has a unique ID within its entity.

In addition to these attributes, your domain might have other attributes configured as mandatory. You'll need to configure how to extract them, too.

URLs to match

Note

You can use the URLs to match features with the advanced web crawler and the API crawler sources.

To ensure that the document extractor only extracts pages or documents whose URL matches a defined pattern, configure URLs to Match.

You can use a regular expression, a Glob expression, or a JavaScript function to define a URL pattern.

A common use case for this setting is to create different extractors for different areas of your source content. For example, you can create a document extractor for your HTML pages and a document extractor for your PDF content.

In case of redirects, configure URLs to Match to include the redirected URLs.

Using a Fixed value

Note

You can specify a fixed value for an attribute in all crawler sources.

If you want an attribute to have a constant value across all documents indexed by a source, you can specify a fixed value for this attribute. This means that all index documents have the same predefined value for this attribute. The crawler does not extract this attribute value from the page meta data.

This method is useful when you need an attribute to have the same value across all index documents, but the attribute value is either not available in your content meta data, is available on only some pages, or is has different values on different pages.

A common use case is to specify a fixed value for the type attribute. For example, you have two sources that index content from a banking website. One source indexes content from the commercial banking section, and the other source indexes content from the consumer banking section. One way to handle the type attribute is to specify a fixed value of commercial in the first source and consumer in the second source.

Using an HTML meta tag or OG tag to extract attribute values

To use an HTML meta tag or an OG tag to extract attribute values, all you have to do is give Search the name of the meta tag or OG tag whose content you want to use as the attribute value. Then, Search internally constructs an XPath expression that evaluates to the attribute value.

Note

You can use the HTML meta tag or OG tag extraction method with a web crawler source.

We recommend that you use this method when attribute values are in your page header, and Search can access them using a simple XPath expression.

When you enter an HTML meta tag or OG tag, you do not see the XPath expression Search creates as it is done internally. However, it is helpful to know the expression that Search creates so that you can evaluate whether this attribute extraction works for you.

The following sections explain how Search constructs XPath expressions from the meta tag or OG tag that you enter.

Example using an HTML meta tag

First, Search constructs an XPath expression that assumes you have entered a meta tag name. If that expression does not yield any result, Search constructs an XPath expression that assumes you have entered a property name.

For example, if you enter author, Search first constructs this XPath expression: //meta[@name='author']/@content. If this expression gives a value, Search stops. If this expression does not give a value, Search then constructs this XPath expression: //meta[@property='author']/@content.

Example using an OG tag

Assume you enter og:site_name. Because this phrase has 'OG' in it, Search assumes that you entered an OG tag and constructs the following XPath expression:

//meta[@property='og:site_name']/@content

Using an XPath document extractor

Use an XPath document extractor when you want to use an XPath expression that evaluates to an attribute value.

Note

You can use an XPath document extractor with the web crawler and advanced web crawler sources. However, for a web crawler, the only setting you need to configure is the XPath expression to extract the attribute value.

Use the following settings to extract attributes using XPath:

Settings	Description
Tag	The entity-based tag you defined in Source Settings > Tags Definitions.
Localized	Whether this document is available in more than one locale. You see this attribute only if your domain has more than one locale.
Attribute	The attribute for which you want to extract a value. Select from the attributes defined during domain setup in the Administration > Domain Settings section.
Value type	The type of rule you want to configure. You can use: Fixed - to add a fixed value to all documents with this attribute. Expressions - to enter an expression to extract the attribute value Note This setting does not apply to the web crawler.
Selectors	The rules Sitecore Search uses when it adds attribute tags to documents. You see this if you select Expressions in the Value type drop-down menu. Add a CSS or XPath EXPRESSION and a DEFAULT TO value. You can define more than one selector for an attribute. This is useful when you have different content arrangements in your source content. For example, the value of the title could be in an `<h1>` HTML tag for some content items, but in an `og:title` metatag in other content items. When you have multiple selectors, Search starts at the top and runs selectors until a selector returns an attribute value. Note This setting does not apply to the web crawler.

Sample XPath extraction attributes

When you select an XPath document extractor, to help configuration, Search adds some attributes and corresponding XPath extraction expressions. You can edit or delete the configuration of any sample attribute.

Search adds the following sample attributes and XPath expressions by default:

type - //meta[@property='og:type']/@content
url - //meta[@property='og:url']/@content
image_url - //meta[@property='og:image']/@content
name - //meta[@property='og:title']/@content
description - //meta[@property='og:description']/@content

Using a JavaScript document extractor

To extract attributes using JavaScript, add a function that defines how the crawler should get values for each attribute. Usually, you use a JavaScript document in scenarios where an XPath expression is not enough to extract an attribute's value.

Note

You can use a JavaScript document extractor with the advanced web crawler and the API crawler sources.

For example, if you want to replace parts of a URL that you get from the page metadata before storing it as an attribute, you must use a JS extractor. You can also use a JavaScript function to create composite attributes, that is, create a new attribute by combining values from more than one existing attribute.

The JavaScript function you define must:

Use Cheerio syntax. That is, it must use this format:
function extract(request, response) { $ = response.body; }
Return an array of objects.

To help with configuration, Search adds the following sample JavaScript function that gets values for the description, name, type, and URL attributes from meta tags:

function extract(request, response) {
    $ = response.body;

    return [{
        'description': $('meta[name="description"]').attr('content') || $('meta[property="og:description"]').attr('content') || $('p').text(),
        'name': $('meta[name="searchtitle"]').attr('content') || $('title').text(),
        'type': $('meta[property="og:type"]').attr('content') || 'website_content',
        'url': $('meta[property="og:url"]').attr('content')
    }];
}

You can edit this function or paste a new function.

See the topic on how to configure a JavaScript extractor for a more complex JavaScript function.

Using a JSONPath document extractor

Use a JSONPath document extractor to extract attribute values from JSON data.

Note

You can use a JSONPath document extractor with the API crawler source.

Use the following settings to extract attributes using JSONPath:

Settings	Description
Tag	The entity-based tag you defined in Source Settings > Tags Definitions.
Localized	Whether this document is available in more than one locale. You see this attribute only if your domain has more than one locale.
Attribute	The attribute for which you want to extract a value. Select from the attributes defined during domain setup in the Administration > Domain Settings section. For example, select the title attribute to define rules to extract the title for each index document.
Value type	The type of rule you want to configure. You can use: Fixed - to add a fixed value to all documents with this attribute. Expressions - to enter a JSONPath expression in Selectors.
Selectors	The rules Sitecore Search uses when it adds attribute tags to documents. You see this if you select Expressions in the Value type drop-down menu. Add a JSONPath EXPRESSION and, optionally, a DEFAULT TO value. You can define more than one selector for an attribute. This is useful when you have different content arrangements in your source content. When you have multiple selectors, Search starts at the top and runs selectors until a selector returns an attribute value. For example, you want to extract the description attribute. For some content items, you might need to use this JSONPath expression: `..placeholders['headless-main']..fields.Description.value` For other content items, you might need to use this JSONPath expression: `..placeholders['headless-main']..fields.Text.value`

Using a CSS document extractor

Use a CSS document extractor when you want to use CSS queries to extract attribute values. Unlike HTML tags that you can extract with XPath, CSS tags do not appear in the page source. For example, titles and images are sometimes within CSS tags.

Note

You can use a CSS document extractor with the advanced web crawler source.

Use the following settings to use a CSS expression to extract attributes:

Setting	Description
Tag	The entity-based tag you defined in Source Settings > Tags Definitions.
Localized	Whether this document is available in more than one locale. You see this attribute only if your domain has more than one locale.
Attribute	The attribute for which you want to extract a value. Select from the attributes defined during domain setup in the Administration > Domain Settings section.
Value type	The type of rule you want to configure. You can use: Fixed - to add a fixed value to all documents with this attribute. Expressions - to enter an expression to extract the attribute value.
Selectors	The rules Sitecore Search uses when it adds attribute tags to documents. You see this if you select Expressions in the Value type drop-down menu. Add a CSS EXPRESSION and a DEFAULT TO value. For example, You can define more than one selector for an attribute. This is useful when you have different content arrangements in your source content.

If you have suggestions for improving this article, let us know!