Create an XPath document extractor
Create an XPath document extractor when you want to use an XPath expression to extract attribute values in a Sitecore Search source.
This topic describes how to create an XPath document extractor for an advanced web crawler. Don't use this procedure to extract attributes using XPath for a web crawler source.
To create an XPath document extractor:
-
On the menu bar, click Sources.
-
Select the advanced web crawler.
-
On the Source Settings page, next to Document Extractors, click
Edit.
-
To create a document extractor, on the Document Extractors page:
-
In the Name field, enter a meaningful name for this extractor.
For example, enter Sitecore dev portal XPath extractor.
-
In the Extractor Type drop-down menu, click XPath.
-
Optionally, to ensure that this extractor's logic only applies to URLs that match a certain pattern, configure URLs to Match. To do this, in the URLs To Match field, click
Add Matcher, select the TYPE of expression you want to use, and enter the VALUE of that expression.
-
-
To define how to extract attributes, in the Taggers section, click
Edit next to the first tag. Usually, it is the content tag.
You see a list of some attributes and corresponding XPath extraction expressions. Search provides these sample attributes and expressions to help with configuration.
NoteIn a document extractor, you can create multiple taggers where each is linked to a unique tag. This way each tagger:
-
Generates a set of index documents.
-
Can have multiple rules such that each rule defines the extraction logic of one attribute.
One tagger with five rules yields one set of documents each with five attributes.
Three taggers with one rule each yields three sets of documents each with one attribute.
-
-
To delete the sample attributes you don't need in this extractor, click Delete
.
-
To edit the configuration of a sample attribute, click Edit
, and then make changes.
-
To configure how to extract the value of a new attribute, click Add Rule
and enter the following details:
-
In the Attribute drop-down menu, select the attribute you want to configure.
For example, click Abstract.
-
In the Value type drop-down menu, choose whether you want the attribute value to be a fixed value or an expression.
For example, click Expressions.
-
In the EXPRESSION field, enter an XPath expression that results in the attribute value.
For example, if you want to get the value of an attribute called abstract from the text of all
p
elements contained withinclass
attributes that contain the word abstract, enter the following XPath expression:RequestResponse//*[contains(@class, "abstract")]//p
-
-
Optionally, to configure more than one way to get an attribute value, click
Add Selector. Then, in the Expression field of the second selector, enter the XPath expression that results in the attribute value.
For example, as a second option, you want to get the abstract from the text of all
p
elements that are within thediv
class oftopic-content
. To do this, enter the following XPath expression:RequestResponse//div[@class="topic-content"]//p
When there are multiple selectors, Search runs them in chronological order and stops when it arrives at an expression that gives a result.
-
In the tag editor, click Save. Then, on the Document Extractors page, click Save.
-
Optionally, to extract attributes for another tag, click
Add Tagger, and in the Tag drop-down menu, click a tag. Then, repeat Steps 7 through 9.
-
On the Document Extractors page, click Save.