Ingestion API document extractor reference
When you use the Ingestion API to create index documents by passing attribute extraction logic, it's crucial to format the field and extractor parameters accurately. Although these parameters are of type string, Sitecore Search expects them to be in a certain format so it can create accurate extraction rules.
This topic outlines how to format these strings properly, complementing the Swagger reference.
We recommend that you familiarize yourself with what a document extractor is, and how to create XPath or JavaScript document extractors.
Because you'll be using the Ingestion API, you don't need to follow the steps included in those topics, but it is helpful to understand the concepts and logic they describe.
The fields parameter
Use fields to override the rules of an existing document extractor and assign a fixed value to an attribute. You'll need to pass the attribute name and attribute value.
You don't have to pass fields when you pass a new extractor because you can provide fixed values for an attribute within the extractor parameter itself, as shown in Passing an XPath document extractor and Passing a JavaScript document extractor.
Use this format to define fixed values for attributes through the fields parameter:
{
"items": {
"<attribute1>": "<value of attribute 1>",
"<attribute2>": "<value of attribute 2>"
}
}For example, to set a fixed value for the type and name attributes, you can pass the following line in your CURL call:
--form 'fields={
"items": {
"type": "pdf",
"name": "Compliance Services Playbook"
}
}'
The extractor parameter
Use the extractor parameter to pass the extraction logic you want to use.
Passing an XPath document extractor with multiple rules for an attribute
To define an XPath document extractor, you'll need to specify the extractor type (xpath) and the rules you want to use to extract values for attributes.
Use this format to pass XPath extraction logic via the extractor parameter:
--form 'extractor=
{
"type": "xpath",
"rules": {
"<attribute1>": {
"selectors": [
{
"expression": "<First XPath expression to get the value of attribute1>"
},
{
"expression": "<Second XPath expression to get the value of attribute1>"
},
]
},
"<attribute2>": {
"selectors": [
{
"expression": "<XPath expression to get the value of <attribute1>>"
}
]
},
"attribite3": {
"value": "<Fixed value for attribute3>"
}
}
}
If you pass multiple selectors, Search applies them in the order they are passed. If the first selector doesn't result in an attribute value, Search tries the second, and so on until it gets an attribute value or there are no more selectors to try
For example, to use XPath to define extraction values for the title, image_url, and type attributes, you can pass the following in your cURL call:
--form 'extractor={
"type": "xpath",
"rules": {
"title": {
"selectors": [
{"expression": "//meta[@property=\'dozen:page:id\']/@content"},
{"expression": "//title/text()"}
]
},
"image_url": {
"selectors": [
{"expression": "//meta[@property=\'og:image\']/@content"}
]
},
"type": {
"value": "The Core"
}
}In that example, you use the following logic to get attribute values:
-
title(first selector): get the value from a custom meta tag identified byproperty='dozen:page:id'. -
title(second selector): get the value from the HTML<title>tag. -
image_url: get the value from theog:imagemeta tag. -
type: assign the fixed value of The Core to all index documents.
Passing a JavaScript document extractor
To define a JavaScript document extractor, you'll need to specify the extractor type (js) and a cheerio syntax JavaScript function that returns an array of objects.
Use this format to use the extractor parameter to pass JavaScript extraction logic:
--form 'extractor={
"type": "js",
"source": "function extract(request, response) {
$ = response.body;
<attribute1> = $('\'<JavaScript code to get the value of attribite1>;
<attribute2> = $('\'<JavaScript code to get the value of attribite2>;
// Return structured data
return [{
'\''<attribute1>\'': <attribute1>,
'\''<attribute2>\'': <attribute2>,
'\''<attribute3>\'': "fixed value" // optional - set a fixed value for an attribute
}];
}"
}'
For example, to use JavaScript to extract values for the title, description, and type attributes from a PDF file, you can pass the following in your cURL call:
--form 'extractor="{\
\"type\": \"js\", \
\"source\": \"function extract(request, response) { \
$ = response.body; \
title = $('\'title\'').text(); \
description = $('\'body\'').text()\
.replace(/(\\\\r\\\\n|\\\\n|\\\\r)/gm, '\'' '\'')\
.replace(/ +/g, '\'' '\'')\
.replace(/\\\\.+/g, '\''.'\'')\
.substring(0, 7000); \
return [{ \
'\''title\'': title, \
'\''description\'': description, \
'\''type\'': \"PDF\" \
}]; \
}\"\
}"'In that example, you use the following logic to get attribute values:
-
title: use a CSS selector that targets the HTML<title>tag. -
description: normalize space and line breaks, remove excessive punctuation, and truncate the text to a maximum of 7,000 characters. -
type: assign the fixed value of PDF to all index documents.