Skip to main content

Walkthrough: Using Apache Tika to extract media content for indexing

Abstract

How to use Apache Tika, a content analysis toolkit, to extract media content for indexing.

You can use Apache Tika to extract media content for indexing as an alternative to the other ways to do this.

This walkthrough describes how to:

It is a prerequisite that you have set up Solr and that Solr is running without problems.

To set up Apache Tika:

  1. Download Apache Tika and save the tika-server-x.x.jar file to the folder you want to run Tika from.

    Note

    Sitecore has been developed and tested using Apache Tika version 1.22. We recommend that you use this version. When new versions of Apache Tika are released, older versions are still available in the archive.

  2. In the folder where you saved the file, open a PowerShell prompt and start Apache Tika:

    java -jar tika-server-x.x.jar --host=<Tikahostname> --port=<portnumber>

    Note

    If you do not specify host and port, Apache Tika uses the defaults of localhost and 9998.

  3. To confirm that Apache Tika is running, browse to the Tika server URL, http://<Tikahostname>:<portnumber>. If the server is running, you can see a Welcome message.

  4. Go to the Sitecore Admin page (https://<sitecoreinstance>/sitecore/admin/showconfig.aspx) and check that TikaMediaFileTextExtractor has been added to the <contentExtraction> node:

    Code sample showing

You can configure Apache Tika as your primary media content extraction provider.

To enable Apache Tika as the primary media content extraction provider:

  1. Open the App_Config\ConnectionStrings.config file, and add this connection string:

    <add name="tika" connectionString=<Tika server url< />
  2. Restart Sitecore.

After setting up and enabling Apache Tika, it is a good idea to verify that indexing works correctly.

To verify that indexing works:

  1. On the Sitecore Launchpad, click Control Panel and rebuild indexes.

  2. In the Content Editor, perform a simple search, for example for the Home item.