Creating an index document by using an existing document extractor
You can use the Ingestion API to add index documents to a pull source by using an existing document extractor. When you do this, you don't need to define any new extraction logic.
You can also create an index document by using a new document extractor or by passing attribute values.
If you use the Ingestion API to add index documents to a pull source like the advanced web crawler, you can create a new document extractor or use an existing one.
If you use the Ingestion API to add index documents to the API push source, you must provide the extractor logic inline as part of the POST request.
Depending on your use case, you can use an existing document extractor to create an index document from a file or from a URL:
-
To create an index document from a file, make a POST call to
{base-URL}/ingestion/v1/domains/{domainID}/sources/{sourceID}/entities/{entityID}/file/{documentID}?locale={locale}. Use this endpoint when you want to create an index document from a file on the machine from which you're making the call. In the request, you pass a path to the file you want indexed, like /Users/Jane/Documents/Summer/summer_activites.pdf.ImportantSearch cannot access files directly from your local file system. You must use multipart form data to upload the file as part of the request.
-
To create an index document from a URL, make a POST call to
{base-URL}/ingestion/v1/domains/{domainID}/sources/{sourceID}/entities/{entityID}/url/{documentID}?locale={locale}. In the request, you pass a URL you want indexed, like https://archive.doc.sitecore.com/xp/en/legacy-docs/web-forms-for-marketers-8.0.pdf. Sitecore Search will access the URL, extract attribute values, and create an index document representing the URL.ImportantThe URL must point to a file with a supported content type, for example, .pdf or .docx. This method cannot be used to index URLs that return HTML content.
In this example, you create an index document from the URL https://archive.doc.sitecore.com/xp/en/legacy-docs/web-forms-for-marketers-8.0.pdf. The type and name attributes will have fixed values, but for all other attributes you want to use the logic defined by a document extractor called Docs JavaScript . To do this, make a POST call to {base-URL}/ingestion/v1/domains/{domainId}/sources/{sourceId}/entities/{entityId}/url/{documentId}.
Here's a sample POST cURL call:
curl --location '{{base-url}}/ingestion/v1/domains/{{domain}}/sources/{{source}}/entities/{{entity}}/url/{{documentId}}?locale={{locale}}' \
--form 'https://archive.doc.sitecore.com/xp/en/legacy-docs/web-forms-for-marketers-8.0.pdfurl=@' \
--form 'fields="{\"items\": {\"type\":\"docs\",\"name\":\"Web Forms for Marketers\"}}"' \
--form 'extractorName="Docs JavaScript"' If your request is successful, you get a 200 response with this body:
{
"enqueued": true
}In the response, enqueued means that Search has added this document to its indexing queue. After a few minutes, you'll see the new document in the Content Collection section of Search.