Walkthrough: Configuring a crawler source to crawl localized content
In this walkthrough, you learn how to configure a source to crawl and index localized content.
The steps you take that are specific to localized content are configuring available locales, configuring locale extractors, and making sure that localized versions of the same content share the same ID.
This walkthrough describes how to:
-
Create a source
-
Configure crawler settings
-
Configure a sitemap trigger
-
Configure available locales
-
Configure a JavaScript locale extractor
-
Configure a JavaScript document extractor for localized content
-
Schedule scan frequency
-
Publish the source
Create a source
Before you configure a source, you must create one.
Configure the scope of the crawler
You configure crawler settings to define important high-level configurations that define the scope of the web crawler.
For example, configure the following settings to define the scope of an advanced web crawler:
-
Go to Sources and then click the source you created.
-
On the Source Settings page, click Edit (
)next to Advanced Web Crawler Settings.
-
Click Save.
Configure a sitemap trigger
The trigger gives the crawler a starting point to look for content to index.
In this example, you use a sitemap as a trigger.
Configure available locales
You configure available locales to define the subset of domain languages you want to use for this source.
For example, you want English (US), French (France), and Japanese (Japan) as locales. The crawler adds English (US) as a default locale, but you must add French (France) and Japanese (Japan).
To configure locales:
-
On the Source Settings page, click Edit (
)next to Available Locales.
-
On the Available Locales page, click inside the LOCALES field to see a list of locales available for this domain and then click the locales you want for this source.
For example, click fr_fr and ja_jp.
-
Click Save.
Configure a JavaScript locale extractor
You configure a locale extractor to define how the crawler must extract the locale from each URL it crawls.
In this example, you use a JavaScript function to define this rule.
You can also configure a URL extractor or a header extractor.
To configure a JavaScript locale extractor:
-
On the Source Settings page, click Edit
next to Locale Extractors.
-
To create a locale extractor, on the Locale Extractors page:
-
In the Name field, enter a meaningful name for the extractor
For example, enter Sitecore.com JS locale extractor.
-
In the Extractor Type drop-down menu, click JS.
-
-
In the JS Source field, enter a JS function in Cheerio syntax to extract locales.
Here's a sample JavaScript function to extract locales:
RequestResponseshellfunction extract(request, response) { parentUrl = request.context.parent.request.url; locales = ['fr-fr','ja-jp']; for (idx in locales) { locale = locales[idx]; if (parentUrl.indexOf('/' + locale + '/') >= 0) { return locale.toLowerCase().replace('-','_'); } } return "en_us"; }
The function uses this logic:
-
If the page URL has a locale, get the locale from the URL of the page and format it in lower case and with an underscrore. This formatting is important because it must match the locale formatting in Available Locales, that is, fr_fr and ja_jp.
-
If the page URL has no locale, assign the index document the default locale of en_us.
-
-
Click Save.
Configure a JavaScript document extractor for localized content
Configure a document extractor to specify which attributes you want the advanced web crawler to extract. When you have localized content, configure the id in addition to other attributes that you might require. You do this to ensure that localized versions of the same content share the same ID.
In this example, you use JavaScript to extract the id, type, title, subtitle, and product attributes.
To use JaveScript to extract attributes when you have localized content:
-
On the Source Settings page, click Edit
next to Document Extractors.
-
To create a document extractor, on the Document Extractors page:
-
In the Name field, enter a meaningful name for the extractor.
For exampleSitecore with languages .
-
In the Extractor Type drop-down menu, click JS.
-
-
In the Taggers section, click Add Tagger .
In the tag editor, you see a sample JavaScript function that returns the description, name, type, image_url and url attributes. Search provides this sample to help with configuration.
-
To enable localized content, turn on the Localized switch.
-
In the tag editor, paste a JavaScript function that returns attribute values. Ensure that the value of the id attribute is constant across localized versions of the same content.
The function must use Cheerio syntax and must return an array of objects.
Foe example, paste the following code:
RequestResponseshellfunction extract(request, response) { $ = response.body; url = request.url; locales = ['/fr-fr/', '/ja-jp/']; for (idx in locales) url = url.replace(locales[idx], '/'); id = url.replaceAll('/', '_').replaceAll(':', '_').replaceAll('.', '_'); return [{ 'type': $('meta[name="type"]').attr('content') || 'Others', 'title': $('h1').text(), 'subtitle': $('meta[name="description"]').attr('content') || $('section[data-component-name="Hero Banner"] div[class*="side-content"]>div>p, header div.lead p').text(), 'product': $('meta[name="product"]').attr('content'), }]; }
The code uses the following logic to get attribute values:
-
id
- get this from the URL after ignoring the locale. To generate an ID from a URL, first replace the locale with a slash (/
). Then, replace all slashes (/
), dots (.
), and colons in the URL with an underscore (_
). -
type
- use the content from the first<meta name="type">
tag. If this tag does not exist, use other. -
title
- use the text first<h1>
HTML tag. -
subtitle
- use the content from the firstmeta name="type"
tag. If this tag does not exist, use the text from the<div class="side-content">
of the firstsection
with a<data-component-name="Hero Banner">
tag -
product
- use the content from the first<meta name="product">
tag.
-
-
In the Edit Taggers window, click Save. Then, on the Document Extractors page, click Save.
Schedule scan frequency
Publish the source
You must publish the source to start the first scan and index.