Creating an index document by passing a new document extractor

When you use the Ingestion API to add index documents to a pull or push source, you can create a temporary document extractor to extract attributes.

The document extractor is temporary because it is passed as part of an Ingestion API request, but Search does not store it. This means if you want to use the same extractor to create multiple index documents, you'll need to pass the whole extractor with each call.

Depending on your use case, you can use a new document extractor to create an index document from a file or from a URL:

  • To create an index document from a file, make a POST call to {base-URL}/ingestion/v1/domains/{domainID}/sources/{sourceID}/entities/{entityID}/file/{documentID}?locale={locale}. Use this endpoint when you want to create an index document from a file on the machine from which you're making the call. In the request, you pass a path to the file you want indexed, like /Users/Jane/Documents/Summer/summer_activites.pdf.

  • To create an index document from a URL, make a POST call to {base-URL}/ingestion/v1/domains/{domainID}/sources/{sourceID}/entities/{entityID}/url/{documentID}?locale={locale}. In the request, you pass a URL you want indexed, like https://archive.doc.sitecore.com/xp/en/legacy-docs/web-forms-for-marketers-8.0.pdf. Sitecore Search will access the URL, extract attribute values, and create an index document representing the URL.

Note

If you use the Ingestion API to add index documents to a pull source like the advanced web crawler, you can create a new document extractor or use an existing one.

If you use the Ingestion API to add index documents to the API push source, you'll have to create a new document extractor. This is because the API push source in Search only creates a placeholder index; no document extractor configuration is possible for this source type.

View the swagger reference to see the detailed data model, and the document extractor reference for details on how to pass extractor logic.

Document extractors and the Ingestion API

A document extractor is a set of rules that define how to extract attribute values from a URL. Depending on the type of original content, you can use XPath, JavaScript , JSONPath , or CSS to create rules.

When you use the Ingestion API to create a new document extractor, you pass attribute extraction logic in the request as a string with a specific format.

Tip

Administrators use the Sitecore Search interface to configure document extractors, including extractors for localized content. Some UI-based steps involve adding code that is similar to the code you'll be adding to the body of Ingestion API request.

We recommend that you read the documentation to understand the concept of document extractors and to view code samples.

Example 1: Create an index document from a file

In this example, you create an index document from a file on your machine, available at /Users/Jane/Documents/Summer/summer_activites.pdf using a JavaScript document extractor. The type and product attributes will have fixed values, but for all other attributes you want to define a document extractor that uses JavaScript to extract attribute values. To do this, make a POST call to {base-URL}/ingestion/v1/domains/{domainID}/sources/{sourceID}/entities/{entityID}/file/{documentID}?locale={locale}

Here's a sample POST cURL call:

RequestResponse
curl --location '{base-url}/ingestion/v1/domains/{{domain}}/sources/{{source}}/entities/{{entity}}/file/{{documentId}}?locale={{locale}}' \ 

--form 'file=@"/Users/Jane/Documents/Summer/summer_activites.pdf"' \ 

--form 'fields="{\"items\": {\"type\":\"pdf\",\"product\":\"kids\"}}"' \ 

--form 'extractor="{\"type\": \"js\", \"source\": \"function extract(request, response) { $ = response.body; title = $('\''title'\'').text(); description = $('\''body'\'').text().replace(/(\\\\r\\\\n|\\\\n|\\\\r)/gm, '\'' '\'').replace(/ +/g, '\'' '\'').replace(/\\\\.+/g, '\''.'\'').substring(0, 7000); return [{ '\''title'\'': title, '\''description'\'': description }]; }\"}"' 

If your request is successful, you get a 200 response with this body:

RequestResponse
{
    "enqueued": true
}

In the response, enqueued means that Search has added this document to its indexing queue. After a few minutes, you'll see the new document in the Content Collection section of the Search.

Example 2: Create an index document from a URL

In this example, you create an index document from a URL, www.mybank.com/mortagages/fixed using an XPath document extractor. The type and name attributes will have fixed values, but you want to define a document extractor that uses XPath expressions to extract attribute values for title and image_url.To do this, make a POST call to {base-URL}/ingestion/v1/domains/{domainID}/sources/{sourceID}/entities/{entityID}/url/{documentID}.

Here's a sample POST cURL call:

RequestResponse
curl --location '{{host}}/ingestion-api/v1/domains/{{domain}}/sources/{{source}}/entities/{{entity}}/url/{{documentId}}?locale={{locale}}' \  

--form 'url=@"www.mybank.com/mortagages/fixed"' \ 

--form 'fields="{\"items\": {\"type\":\"Mortages\",\"name\":\"Fixed mortgages \"}}"' \ 

--form 'extractor={"type":"xpath","rules":{"title":{"selectors":[{"expression":"//meta[@property=\'dozen:page:id\']/@content"}]},"image_url":{"selectors":[{"expression":"//meta[@property=\'og:image\']/@content"}]}}}'

If your request is successful, you get a 200 response with this body:

RequestResponse
{
    "enqueued": true
}

In the response, enqueued means that Search has added this document to its indexing queue. After a few minutes, you'll see this document in the Content Collection section of Search.

Do you have some feedback for us?

If you have suggestions for improving this article,