Configure indexing

Index content of media files

Version:

The Sitecore Content Search API uses the following open-source libraries to extract text content from media files for indexing:

PDFSharp for PDF files
Open XML SDK for DOCX, XLSX and PPTX files

Note

You can also use Apache Tika to extract media content for indexing.

Configure media indexing

The default text extractor supports only the following file formats .pdf, .docx, .xlsx, and .pptx. In case you need an extended list, consider using one of the alternative text extractors, Solr Cell, Tika, or IFilter.

In this case, Sitecore will use the configuration inside the <mediaIndexing> section and trie to index the content with these extensions: rtf, odt, doc, dot, docx, dotx, docm, dotm, xls, xlt, xla, xlsx, xlsm, xltm, xlam, xlsb, ppt, pot, pps, ppa, pptx, potx, ppsx, ppam, pptm, potm, and ppsm, or MIME types application/pdf, text/html, and text/plain. The success of the operation depends on the iFilters installed in the system or file formats supported by Solr Cell or Tika.

You cannot use the iFilter option if the application is deployed as an Azure Web App.

If you want to index a different set of file types, you can specify the file types by patching the mediaIndexing configuration node for the search provider you use. For Solr, the default configuration is in the App_Config\Sitecore\ContentSearch\Sitecore.ContentSearch.Solr.DefaultIndexConfiguration.config file.

There are two nodes under the mediaIndexing node:

mimeTypes

This node specifies the MIME types that are included (indexed) or excluded (not indexed).
extensions

This node specifies the file extensions of files that are included (indexed) or excluded (not indexed).

Each node has two nodes: excludes and includes, specifying what to not index and what to index. You can use an asterisk (*) as a wildcard in both places. You can implement whitelisting and blacklisting by using wildcards. Whitelist by adding the wildcard in the excludes node and then add the whitelisted extensions or MIME types in the includes node. Blacklist by adding the wildcard in the includes node, and then use the excludes node to blacklist specific extensions or MIME types.

Configure content search to use IFilter for indexing

You can set up the Sitecore Content Search API to use the native Microsoft Windows IFilter interface.

PDF iFilter is not installed by default. To index content of PDF files, locate each Sitecore instance that performs indexing. When you use the Solr provider, it is usually a CM instance. You must install a PDF IFilter on each machine that hosts such a Sitecore instance.

To enable IFIlter to extract media file content for indexing:

Modify (patch) App_Config\Sitecore\ContentSearch\Sitecore.ContentSearch.ContentExtraction.config as follows:
<mediaFileTextExtractor type="Sitecore.ContentSearch.ContentExtraction.IFilter.IFilterMediaFileTextExtractor, Sitecore.ContentSearch.ContentExtraction"/>

Limitations of PDF content extraction

Sitecore uses the PDFSharp library to extract text from PDF files for indexing. Sitecore only opens PDFs in read-only mode using PDFSharp, and extracts their text content. However, certain PDF configurations prevent successful extraction.

Supported PDFs

Sitecore can index the following types of PDF files:

Unencrypted PDFs - fully supported.
RC4-encrypted PDFs with no User password - supported. Because Sitecore only opens PDFs for read-only access, it only needs to supply the User password (not the Owner password) when opening an encrypted file. If the User password is empty, the file can be indexed, even if an Owner password is set.

Unsupported PDFs

Sitecore cannot index the following types of PDF files:

Scenario	Error message
The PDF uses an encryption algorithm not supported by PDFSharp (for example, AES). PDFSharp only supports the legacy RC4 stream cipher.	`The PDF document is protected with an encryption not supported by PDFsharp`
The PDF uses RC4 encryption and requires a User password to open.	`To modify the document the owner password is required`
The PDF file is corrupted or malformed.	`Unexpected token 'endobj' in PDF stream. The file may be corrupted.`

Workarounds

If your PDF files cannot be indexed due to encryption limitations, consider the following options:

Remove or change the encryption - re-save the PDF without encryption, or with RC4 encryption and an empty User password, so that PDFSharp can open it for read access.
Use an alternative text extractor - configure Sitecore to use Apache Tika, Solr Cell, or IFilter instead of the default PDFSharp extractor. These alternatives may support a broader range of PDF encryption types.

If you have suggestions for improving this article, let us know!