Analyzers

Sitecore Search uses analyzers to ensure that searches return all relevant results instead of just exact matches. They are used to varying degrees in filters, personalization, suggestion blocks, textual relevance, and sorting options.

Analyzers convert text input into a structured format that is optimized for search. They do this using the following three-step process:

  1. If the analyzer uses character filters, it applies them. This means that certain characters are replaced or removed. For example, non-alphanumeric characters such as punctuation might be removed.

  2. The analyzer tokenizes the search phrase. This means that the phrase is split up into smaller chunks called tokens. These tokens are usually single words, but can also be partial words or phrases.

  3. If the analyzer uses token filters, it applies them. This means that the tokens are transformed using a variety of methods including applying synonyms, reducing tokens to their root words, removing stop words, and so on.

Note

For more information about analyzers, see the ElasticSearch documentation on the topic.

In the vast majority of cases, we recommend that you use the analyzers that are applied by default when you configure an attribute for use with a feature. However, there are a variety of analyzers available to you in Sitecore Search. These include basic analyzers and advanced analyzers.

Basic analyzers

Basic analyzers include the multi locale standard, standard, alphanumeric only, keyword, lowercase, and prefix match analyzers. You use these analyzers in the vast majority of use cases.

Multi locale standard

The multi locale standard analyzer processes input by making it lowercase, determining the root form of each word using stemming, applying synonyms, and removing stop words and punctuation. It takes locale into account while it does this.

For example, if a visitor searches for How can I improve search results?, the standard analyzer outputs the following tokens: how, can, i, improv, search, and result. In this example, the capital letters have been made lowercase, the words improve and results have been reduced to their root form, and the question mark has been removed.

The multi locale standard analyzer is able to work differently in different locales. For example, there are different stop words in different languages: here is a stop word in English, and aquí is the corresponding stop word in Spanish. In Spanish-speaking locales, the multi locale standard analyzer takes this difference into account.

We recommend you use the multi locale standard analyzer for textual relevance, even if your domain does not support multiple locales.

Standard

The standard analyzer is an older, English-only version of the multi locale standard analyzer. It performs all of the same operations as multi locale standard but without taking locale into account.

You can use this analyzer for textual relevance if you only work with English-language data.

Alphanumeric only

The alphanumeric only analyzer performs all of the same transformations as the standard analyzer, but it also strips all non-alphanumeric characters instead of using them as token separators.

For example, in the case of a document ID 1235-abhe-3f34s, the alphanumeric only analyzer generates a single token: 1235abhe3f34s. This is useful when you want visitors to be able to search both with and without the hyphens. This result is different from the standard analyzer, which uses the hyphens to separate the ID into three tokens: 1235, abhe, and 3f34s.

This analyzer is often used for sorting or filtering.

Keyword

The keyword analyzer generates the input text as a single token.

For example, if a visitor searches for Sitecore Search, the keyword analyzer creates a single token: Sitecore Search. This means you couldn't match sitecore or search individually as matches only work with the full exact text.

This analyzer is useful for filters or other special cases where you need an exact match.

Lowercase

The lowercase analyzer produces a single output token with the whole input in lowercase form.

For example, if a visitor searches for How to create Search experiences, the lowercase analyzer generates the following token: how to create search experiences.

This analyzer is often used for sorting or filtering.

Prefix match

The prefix match analyzer generates lowercase prefixes with lengths ranging from 3 to 15 characters, stripping all non-alphanumeric characters from the input.

For example, if a visitor searches for the ISBN 978-3-16-148410-0, the tokens generated include 978, 9783, 97831, 978316, and so on.

This analyzer is often used in textual relevance for matching unique IDs.

Advanced analyzers

For a minority of use cases, you might need to use one of the following analyzers: ngram based matching, partial match, shingle generator, or standard no stemmer.

Ngram based matching

The Ngram based matching analyzer breaks text into words, then creates n-grams of length n for each word.

For example, if a visitor searches for Sitecore Search and the value of n is 2, the tokens generated include Si, it, te, ec, co, or, re, Se, ea, and so on.

This analyzer is useful for querying languages that don’t use spaces, like Japanese, and languages that have long compound words, like German. It is also useful when working with prefixes. Ngram based matching is often used in suggestion blocks.

Partial match

The partial match analyzer generates lowercase variants of the input tokens, both splitting and joining on special characters and removing stop words.

For example, if a visitor searches for How do I keep Sitecore Search results up-to-date, it generates the following tokens: how, do, i, keep, sitecore, search, results, up, date, and uptodate. In this example, all of the words are converted to lowercase, and the hyphenated word up-to-date is split into separate tokens (while removing the stop word to) and joined into a single token: uptodate.

Shingle generator

The shingle generator analyzer works by creating word-level n-grams called shingles.

For example, if a visitor searches for How to improve search results and the analyzer was configured to create 2-word long shingles, the following tokens are generated: How to, to improve, improve search, and search results.

This analyzer is useful for extracting partial data and matching against it. The shingle generator analyzer is often used in suggestion blocks.

Standard no stemmer

The standard no stemmer analyzer performs the same operations as the standard analyzer but without reducing tokens to their root form using stemming.

For example, if a visitor searches for How to improve search results?, the standard no stemmer analyzer generates the following tokens: how, improve, search, and results. The words are converted to lowercase, the stop word to is removed, and the question mark is removed. In contrast to the standard analyzer, the word improve is not changed to its root form, improv.

Do you have some feedback for us?

If you have suggestions for improving this article,