Index content of media files
The Sitecore Content Search API uses the following open-source libraries to extract text content from media files for indexing:
-
PDFSharp for PDF files
-
Open XML SDK for DOCX, XLSX and PPTX files
Configure media indexing
The default text extractor supports only the following file formats .pdf, .docx, .xlsx, and .pptx. In case you need an extended list, please consider using the IFilter text extractor.
In this case, Sitecore will use the configuration inside the <mediaIndexing>
section and would try to index the content with these extensions: rtf, odt, doc, dot, docx, dotx, docm, dotm, xls, xlt, xla, xlsx, xlsm, xltm, xlam, xlsb, ppt, pot, pps, ppa, pptx, potx, ppsx, ppam, pptm, potm, and ppsm, or MIME types application/pdf, text/html, and text/plain. The success of the operation depends on the iFilters installed in the system.
You cannot use iFilter option cannot if the application is deployed as an Azure Web App.
If you want to index a different set of file types, you can specify the file types by patching the mediaIndexing
configuration node for the search provider you use. For Solr, the default configuration is in the App_Config\Sitecore\ContentSearch\Sitecore.ContentSearch.Solr.DefaultIndexConfiguration.config
file.
There are two nodes under the mediaIndexing
node:
-
mimeTypes
This node specifies the MIME types that are included (indexed) or excluded (not indexed).
-
extensions
This node specifies the file extensions of files that are included (indexed) or excluded (not indexed).
Each node has two nodes: excludes
and includes
, specifying what to not index and what to index. You can use an asterisk (*) as a wildcard in both places. You can implement whitelisting and blacklisting by using wildcards. Whitelist by adding the wildcard in the excludes
node and then add the whitelisted extensions or MIME types in the includes
node. Blacklist by adding the wildcard in the includes
node, and then use the excludes
node to blacklist specific extensions or MIME types.
Configure content search to use IFilter for indexing
You can set up the Sitecore Content Search API to use the native Microsoft Windows IFilter interface.
PDF iFilter is not installed by default. To index content of PDF files, locate each Sitecore instance that performs indexing. When you use the Solr or Azure Search provider, it is usually a CM instance. You must install a PDF IFilter on each machine that hosts such a Sitecore instance.
To enable IFIlter to extract media file content for indexing:
-
modify (patch)
App_Config\Sitecore\ContentSearch\Sitecore.ContentSearch.config
and change the value of theimplementationType
attribute like this:RequestResponseshell<services> <register serviceType="Sitecore.ContentSearch.ContentExtraction.IMediaFileTextExtractor, Sitecore.ContentSearch.ContentExtraction" implementationType="Sitecore.ContentSearch.Extractors.IFilterMediaExtractor, Sitecore.ContentSearch" /> </services>