Walkthrough: Using Apache Tika to extract media content for indexing
How to use Apache Tika, a content analysis toolkit, to extract media content for indexing.
You can use Apache Tika to extract media content for indexing as an alternative to the other ways to do this.
This walkthrough describes how to:
It is a prerequisite that you have set up Solr and that Solr is running without problems.
To set up Apache Tika:
Download Apache Tika and save the
tika-server-x.x.jar
file to the folder you want to run Tika from.Note
Sitecore has been developed and tested using Apache Tika version 1.22. We recommend that you use this version. When new versions of Apache Tika are released, older versions are still available in the archive.
In the folder where you saved the file, open a PowerShell prompt and start Apache Tika:
java -jar tika-server-x.x.jar --host=<Tikahostname> --port=<portnumber>
Note
If you do not specify
host
andport
, Apache Tika uses the defaults of localhost and 9998.To confirm that Apache Tika is running, browse to the Tika server URL,
http://<Tikahostname>:<portnumber>
. If the server is running, you can see a Welcome message.Go to the Sitecore Admin page (
https://<sitecoreinstance>/sitecore/admin/showconfig.aspx
) and check thatTikaMediaFileTextExtractor
has been added to the<contentExtraction>
node:
You can configure Apache Tika as your primary media content extraction provider.
To enable Apache Tika as the primary media content extraction provider:
Open the
App_Config\ConnectionStrings.config
file, and add this connection string:<add name="tika" connectionString=<Tika server url< />
Restart Sitecore.
After setting up and enabling Apache Tika, it is a good idea to verify that indexing works correctly.
To verify that indexing works:
On the Sitecore Launchpad, click Control Panel and rebuild indexes.
In the Content Editor, perform a simple search, for example for the Home item.