Configure indexing

Configure for PDFs

Sitecore Search can index PDFs and have them appear in search results. Similarly to when you configure a source to index HTML or JSON content, when you index PDF content, each PDF becomes an index document with attributes like title, description, and url, for example.

Sitecore Search document extractors can only parse HTML or JSON. To effectively extract attribute values from PDF content, you have to understand the HTML structure of your PDFs.

Unlike HTML pages in a browser, you can't directly inspect or view the source of a PDF. However, you can configure a temporary Search source to view the HTML structure of PDFs. This source is called temporary because its only purpose is to allow you to view the HTML structure of PDFs. You don't use the index documents from this source to create a search experience.

Important

To view the HTML structure of a PDF, use a Search source and not an external tool. External converters might give you HTML with syntax variations, which might lead to unexpected attribute values.

After you understand the HTML structure of your PDFs, you can configure a source to index PDF content.

Note

We recommend that you use an advanced web crawler source to index PDF content because you will usually require the logical power that a JavaScript document extractor provides.

Document extractors and indexing PDFs

Just like when you index HTML content, you can use any type of document extractor to extract PDF content. However, you will usually use a JavaScript document extractor for PDFs because it can support complex use cases like sanitizing text and applying conditions.

When you configure a document extractor for PDF content, keep the following considerations in mind:

To ensure the extraction rules you define apply only to PDFs and not other content like HTML, define URLs to Match. This ensures the crawler only applies rules to URLs that match a defined pattern.

Usually, the GLOB expression **/*.pdf suffices because it searches recursively.
To ensure that all PDFs are marked as such, we recommend that you set the type attribute to the fixed value of pdf.
To extract other attributes like title, description, URL, and parent_url, use the HTML structure of PDFs you extracted through the temporary source.

To help you, we've created some sample document extractors.

Crawler settings for indexing PDFs

Crawler settings define the scope and working of the crawler. In crawler settings, you define things like how many links to follow from a URL, whether to avoid any URLs, and any header information that needs to be passed to each URL.

When you index PDF documents, we recommend that you review at least the following settings:

Max Depth - if you want the crawler to find PDF URLs by following hyperlinks, ensure that the MAX DEPTH is at least 1. If the MAX DEPTH is 0, the crawler will not follow any hyperlinks, including PDF URLs.

Note

If your trigger is a sitemap or sitemap index that has PDF URLs in it, leave the default MAX DEPTH as the default of 0.
Allowed Domains - if you specify allowed domains, and your PDFs are hosted on a domain that is different from HTML pages, add the domain that contains PDFs. For example, if your HTML pages are on www.bank.com but your PDFs are on wwwbank.hostingservice.net, add wwwbank.hostingservice.net as an allowed domain.

Request extractors and indexing PDFs

Usually, triggers provide the starting points for the crawler. However, if the trigger does not cover PDFs, you will need to use request extractors to create additional requests.

Note

Configure the MAX DEPTH to at least 1 for request extractors to work.

For example, if the trigger is a Sitemap with all HTML and PDF URLs, you don't need to configure request extractors. However, if the trigger is a Sitemap with only HTML pages, and you have PDF URLs hidden within these HTML pages, create a request extractor to generate PDF URLS for the crawler.

Here's a sample JavaScript request extractor to get only PDF URLs in a page:

function extract(request, response) {
  const $ = response.body;
  const regex = /.*\.pdf(?:\?.*)?$/;
  return $('a')
    .toArray()
    .map((a) => $(a).attr('href'))
    .filter((url) => regex.test(url))
    .map((url) => ({ url }));
}

Attributes specific to indexing PDFs

Sitecore Search does not require any additional attributes to index PDFs. However, you can create custom attributes that improve how your search results are organized and presented.

For example, if you want to track the parent page that a PDF belongs to, you can optionally create and publish an attribute called parent_url, of the type string. You'll then need to configure the document extractor to extract a value for this attribute.

The parent_url attribute is useful when PDF URLs look different from the URL of the page they are hosted on. For example, the page https://www.bank.com/legal/ might contain PDFs with URLs like https://wwwbank.hostingservice.net/february-2023--global.pdf?md=20230215T151008Z., which is in a different domain.

Use the following settings for the parent_url attribute:

Entity - Content, or any other entity you want.
Display Name - Parent URL, or similar.
Attribute Name - parent_url, or similar.
Placement - Standard.
Data Type - String.

If you have suggestions for improving this article, let us know!