Configure indexing

Index content of media files

Version:

The Sitecore Content Search API uses the following open-source libraries to extract text content from media files for indexing:

PDFSharp for PDF files
Open XML SDK for DOCX, XLSX and PPTX files

Configure media indexing

The default text extractor supports only the following file formats .pdf, .docx, .xlsx, and .pptx. In case you need an extended list, please consider using the IFilter text extractor.

In this case, Sitecore will use the configuration inside the <mediaIndexing> section and would try to index the content with these extensions: rtf, odt, doc, dot, docx, dotx, docm, dotm, xls, xlt, xla, xlsx, xlsm, xltm, xlam, xlsb, ppt, pot, pps, ppa, pptx, potx, ppsx, ppam, pptm, potm, and ppsm, or MIME types application/pdf, text/html, and text/plain. The success of the operation depends on the iFilters installed in the system.

You cannot use iFilter option cannot if the application is deployed as an Azure Web App.

If you want to index a different set of file types, you can specify the file types by patching the mediaIndexing configuration node for the search provider you use. For Solr, the default configuration is in the App_Config\Sitecore\ContentSearch\Sitecore.ContentSearch.Solr.DefaultIndexConfiguration.config file.

There are two nodes under the mediaIndexing node:

mimeTypes

This node specifies the MIME types that are included (indexed) or excluded (not indexed).
extensions

This node specifies the file extensions of files that are included (indexed) or excluded (not indexed).

Each node has two nodes: excludes and includes, specifying what to not index and what to index. You can use an asterisk (*) as a wildcard in both places. You can implement whitelisting and blacklisting by using wildcards. Whitelist by adding the wildcard in the excludes node and then add the whitelisted extensions or MIME types in the includes node. Blacklist by adding the wildcard in the includes node, and then use the excludes node to blacklist specific extensions or MIME types.

Configure content search to use IFilter for indexing

You can set up the Sitecore Content Search API to use the native Microsoft Windows IFilter interface.

PDF iFilter is not installed by default. To index content of PDF files, locate each Sitecore instance that performs indexing. When you use the Solr or Azure Search provider, it is usually a CM instance. You must install a PDF IFilter on each machine that hosts such a Sitecore instance.

To enable IFIlter to extract media file content for indexing:

modify (patch) App_Config\Sitecore\ContentSearch\Sitecore.ContentSearch.config and change the value of the implementationType attribute like this:

<services>  
    <register serviceType="Sitecore.ContentSearch.ContentExtraction.IMediaFileTextExtractor, Sitecore.ContentSearch.ContentExtraction" implementationType="Sitecore.ContentSearch.Extractors.IFilterMediaExtractor, Sitecore.ContentSearch" />
</services>

If you have suggestions for improving this article, let us know!