Index content of media files

Abstract

Use IFilter to index PDF files in the media library.

The Sitecore Content Search API uses the following open-source libraries to extract text content from media files for indexing:

Note

You can also use Apache Tika to extract media content for indexing.

Configure media indexing

The default text extractor supports only the following file formats .pdf, .docx, .xlsx, and .pptx. In case you need an extended list, consider using one of the alternative text extractors, Solr Cell, Tika, or IFilter.

In this case, Sitecore will use the configuration inside the <mediaIndexing> section and trie to index the content with these extensions: rtf, odt, doc, dot, docx, dotx, docm, dotm, xls, xlt, xla, xlsx, xlsm, xltm, xlam, xlsb, ppt, pot, pps, ppa, pptx, potx, ppsx, ppam, pptm, potm, and ppsm, or MIME types application/pdf, text/html, and text/plain. The success of the operation depends on the iFilters installed in the system or file formats supported by Solr Cell or Tika.

You cannot use the iFilter option if the application is deployed as an Azure Web App.

If you want to index a different set of file types, you can specify the file types by patching the mediaIndexing configuration node for the search provider you use. For Solr, the default configuration is in the App_Config\Sitecore\ContentSearch\Sitecore.ContentSearch.Solr.DefaultIndexConfiguration.config file.

There are two nodes under the mediaIndexing node:

• mimeTypes

This node specifies the MIME types that are included (indexed) or excluded (not indexed).

• extensions

This node specifies the file extensions of files that are included (indexed) or excluded (not indexed).

Each node has two nodes: excludes and includes, specifying what to not index and what to index. You can use an asterisk (*) as a wildcard in both places. You can implement whitelisting and blacklisting by using wildcards. Whitelist by adding the wildcard in the excludes node and then add the whitelisted extensions or MIME types in the includes node. Blacklist by adding the wildcard in the includes node, and then use the excludes node to blacklist specific extensions or MIME types.

Configure content search to use IFilter for indexing

You can set up the Sitecore Content Search API to use the native Microsoft Windows IFilter interface.

PDF iFilter is not installed by default. To index content of PDF files, locate each Sitecore instance that performs indexing. When you use the Solr or Azure Search provider, it is usually a CM instance. You must install a PDF IFilter on each machine that hosts such a Sitecore instance.

To enable IFIlter to extract media file content for indexing:

• Modify (patch) App_Config\Sitecore\ContentSearch\Sitecore.ContentSearch.ContentExtraction.config as follows:

<mediaFileTextExtractor type="Sitecore.ContentSearch.ContentExtraction.IFilter.IFilterMediaFileTextExtractor, Sitecore.ContentSearch.ContentExtraction"/>