Walkthrough: Configuring a web crawler source
The Sitecore Search web crawler is a source that crawls your content by starting from a point and following hyperlinks. For each hyperlink it comes across, the web crawler creates an index document. The web crawler is straightforward, easy to configure, and requires no coding.
The web crawler begins from a starting point, called a trigger. If your starting point is a sitemap or sitemap index, the crawler crawls all URLs in the sitemap or sitemap index. If your starting point is a single URL, called a request, the crawler, follows hyperlinks from those pages to new pages.
The web crawler can crawl and index HTML pages, PDFs, and all Microsoft Office content.
This walkthrough describes how to:
-
Create a web crawler source
-
Configure the scope of the advanced web crawler
-
Use an HTML tag to extract the value for an attribute
-
Use XPath to extract the value for an attribute
-
Use a fixed value for an attribute value
-
Schedule scans
-
Publish the source
Create a web crawler source
In this example, select Web Crawler in step 5.
Configure the trigger and scope of the web crawler
You configure web crawler settings to define the trigger and other important high-level configurations that define the scope of the advanced web crawler.
To configure the trigger and scope of the web crawler:
-
Go to Sources and then click the source you created.
-
On the Source Settings page, click Edit (
) next to Web Crawler Settings.
-
To configure a trigger for the web crawler, in the TRIGGER TYPE drop-down menu, click the type of trigger you want to use.
For example, click Sitemap.
NoteYou can also create a sitemap index trigger or a request trigger.
-
In the URL field, enter the URL of the trigger.
For example, enter the sitemap URL https://doc.sitecore.com/search/sitemap.xml.
NoteIf you select a sitemap or sitemap index trigger, you can add more than one URL.
-
Click Save.
Use HTML meta tags to extract the value for an attribute
You can use HTML meta tags to extract an attribute value.
In this example, you use an HTML meta tag.
To use an HTML tag to extract the value for an attribute:
-
On the Source Settings page, click Edit
next to Attribute Extraction.
-
On the Attribute Extraction page, click Add Attribute
.
-
In the attribute selector, click Add
next to the attribute whose value you want to extract, and then click Add at the bottom.
In this example, click Description.
-
In the Type drop-down menu, click Meta Tag
-
In the VALUE field, enter the name of the meta tag or property tag you want to use.
For example, enter description. When you do this, Search internally creates the following XPath expressions to get the value of the description attribute from the content of description meta tag:
First, it creates and runs
//meta[@name='description']/@content
. If it does not get a value from this expression, it creates and runs//meta[@property='description']/@content
. In this example, however, the first expression will give a value. -
Click Save.
Use an XPath expression to extract the value of an attribute
You can use an XPath expression to extract an attribute value.
To use an XPath expression to extract the value of an attribute:
-
On the Source Settings page, click Edit
next to Attribute Extraction
-
On the Attribute Extraction page, click Add Attribute
.
-
In the attribute selector, click Add
next to the attribute whose value you want to extract, and then click Add at the bottom.
For example, click Name.
-
In the Type drop-down menu, click Xpath
-
In the VALUE field, enter the XPath expression you want to use.
For example, enter //div[@class='wy-menu-vertical']/p. This tells Search to get the value of the name attribute from the text within the
<p>
tag of the page's first<div>
tag with aclass
value ofwy-menu-vertical
. -
Click Save.
Use a fixed value for an attribute value
You can assign a fixed value for an attribute if the value is a constant and does not need to be extracted.
For example, a common use case is to assign a fixed value for the type attribute.
To use a fixed value for an attribute:
-
On the Source Settings page, click Edit
next to Attribute Extraction
-
On the Attribute Extraction page, click Add Attribute
.
-
In the attribute selector, click Add
next to the attribute whose value you want to extract, and then click Add at the bottom.
For example, click Type.
-
In the Type drop-down menu, click Fixed
-
In the VALUE field, enter the constant to use for this attribute.
For example, enter Sitecore Search documentation.
-
Click Save.
Schedule scans
Publish the source
You must publish the source to trigger the first scan and index.