Index content of media files

Abstract

Use IFilter to index PDF files in the media library.

The Sitecore Content Search API uses the following open-source libraries to extract text content from media files for indexing:

Note

You can also use Apache Tika to extract media content for indexing.

Configure media indexing

By default, Sitecore indexes media file types with these extensions: rtf, odt, doc, dot, docx, dotx, docm, dotm, xls, xlt, xla, xlsx, xlsm, xltm, xlam, xlsb, ppt, pot, pps, ppa, pptx, potx, ppsx, ppam, pptm, potm, and ppsm, and MIME types application/pdf, text/html, and text/plain.

If you want to index a different set of file types, you can specify the file types by patching the mediaIndexing configuration node for the search provider you use. For Solr, the default configuration is in the App_Config\Sitecore\ContentSearch\Sitecore.ContentSearch.Solr.DefaultIndexConfiguration.config file.

There are two nodes under the mediaIndexing node:

  • mimeTypes

    This node specifies the MIME types that are included (indexed) or excluded (not indexed).

  • extensions

    This node specifies the file extensions of files that are included (indexed) or excluded (not indexed).

Each node has two nodes: excludes and includes, specifying what to not index and what to index. You can use an asterisk (*) as a wildcard in both places. You can implement whitelisting and blacklisting by using wildcards. Whitelist by adding the wildcard in the excludes node and then add the whitelisted extensions or MIME types in the includes node. Blacklist by adding the wildcard in the includes node, and then use the excludes node to blacklist specific extensions or MIME types.

Configure content search to use IFilter for indexing

You can set up the Sitecore Content Search API to use the native Microsoft Windows IFilter interface.

PDF iFilter is not installed by default. To index content of PDF files, locate each Sitecore instance that performs indexing. When you use the Solr or Azure Search provider, it is usually a CM instance. You must install a PDF IFilter on each machine that hosts such a Sitecore instance.

To enable IFIlter to extract media file content for indexing:

  • modify (patch) App_Config\Sitecore\ContentSearch\Sitecore.ContentSearch.config and change the value of the implementationType attribute like this:

    <services>
      <register serviceType="Sitecore.ContentSearch.ContentExtraction.IMediaFileTextExtractor, Sitecore.ContentSearch.ContentExtraction" implementationType="Sitecore.ContentSearch.Extracters.IFilterTextExtraction.IFilterMediaExtractor, Sitecore.ContentSearch" />
    </services>