Configure a localized advanced web crawler

In this walkthrough, you learn how to configure a source to crawl and index localized content.

The steps you take that are specific to localized content are configuring available locales, configuring locale extractors, and making sure that localized versions of the same content share the same ID.

This walkthrough describes how to:

Note

If your original content requires authentication before the crawler can access it, configure crawler authentication settings. For example, your original content might need a GUI-based user name and password, or an access token or key in the request header. You can do this at any point after source creation.

Create a source

Before you configure a source, you must create one.

Configure the scope of the crawler

You configure crawler settings to define important high-level configurations that define the scope of the web crawler.

To configure the scope of the advanced web crawler:

  1. On the Source Settings page, click Edit next to Advanced Web Crawler Settings.

  2. To stop the crawler from accepting navigation cookies, turn off ENABLE NAVIGATION COOKIES DURING CRAWLING.

    When you index content in multiple locales, it's important to disable cookie acceptance. This is because cookies can limit the crawler to the first locale it encounters, and the crawler might not index content across all locales.

  3. Click Save.

Configure a sitemap trigger

The trigger gives the crawler a starting point to look for content to index.

Configure available locales

You configure available locales to define the subset of domain languages you want to use for this source.

For example, you want English (US), French (France), and Japanese (Japan) as locales. The crawler adds English (US) as a default locale, but you must add French (France) and Japanese (Japan).

To configure locales:

  1. On the Source Settings page, click Edit next to Available Locales.

  2. On the Available Locales page, click inside the LOCALES field to see a list of locales available for this domain and then click the locales you want for this source.

    For example, click fr_fr and ja_jp.

  3. Click Save.

Configure a JavaScript locale extractor

You configure a locale extractor to define how the crawler must extract the locale from each URL it crawls.

In this example, you use a JavaScript function to define this rule.

Note

You can also configure a URL extractor (using a REGEX) or a header extractor.

To configure a JavaScript locale extractor:

  1. On the Source Settings page, click Edit next to Locale Extractors.

  2. To create a locale extractor, on the Locale Extractors page:

    • In the Name field, enter a meaningful name for the extractor.

      For example, enter Sitecore.com JS locale extractor.

    • In the Extractor Type drop-down menu, click JS.

  3. In the JS Source field, enter a JS function in Cheerio syntax to extract locales.

    Here's a sample JavaScript function to extract locales:

    RequestResponse
    function extract(request, response) {
        locales = ['fr-fr','ja-jp'];
        for (idx in locales) {
            locale = locales[idx];
            if (request.url.indexOf('/' + locale + '/') >= 0) {
                return locale.toLowerCase().replace('-','_');
            }
        }
        return "en_us";
    }
    
    

    The function uses this logic:

    • If the page URL has a locale, get the locale from the URL of the page and format it in lower case and with an underscore. This formatting is important because it must match the locale formatting in Available Locales, that is, fr_fr and ja_jp.

    • If the page URL has no locale, assign the index document the default locale of en_us.

  4. Click Save.

Configure a JavaScript document extractor for localized content

Configure a document extractor to specify which attributes you want the advanced web crawler to extract. When you have localized content, configure the id in addition to other attributes that you might require. You do this to ensure that localized versions of the same content share the same ID.

In this example, you use JavaScript to extract the id, type, title, subtitle, and product attributes.

To configure a JavaScript document extractor for localized content:

  1. On the Source Settings page, click Edit next to Document Extractors.

  2. To create a document extractor, on the Document Extractors page:

    • In the Name field, enter a meaningful name for the extractor.

      For exampleSitecore with languages .

    • In the Extractor Type drop-down menu, click JS.

  3. In the Taggers section, click Add Tagger .

    In the tag editor, you see a sample JavaScript function that returns the description, name, type, image_url and url attributes. Search provides this sample to help with configuration.

  4. To enable localized content, turn on the Localized switch.

  5. In the tag editor, paste a JavaScript function that returns attribute values. Ensure that the value of the id attribute is constant across localized versions of the same content.

    The function must use Cheerio syntax and must return an array of objects.

    For example, paste the following code:

    RequestResponse
    function extract(request, response) {
        $ = response.body;
        url = request.url;
        locales = ['/fr-fr/', '/ja-jp/'];
        for (idx in locales) url = url.replace(locales[idx], '/');
        id = url.replaceAll('/', '_').replaceAll(':', '_').replaceAll('.', '_');
        return [{
            'type': $('meta[name="type"]').attr('content') || 'Others',
            'title': $('h1').text(),
            'subtitle': $('meta[name="description"]').attr('content') || $('section[data-component-name="Hero Banner"] div[class*="side-content"]>div>p, header div.lead p').text(),
            'product': $('meta[name="product"]').attr('content'),
        }];
    }
    

    The code uses the following logic to get attribute values:

    • id - get this from the URL after ignoring the locale. To generate an ID from a URL, first replace the locale with a slash (/). Then, replace all slashes (/), dots (.), and colons in the URL with an underscore (_).

    • type - use the content from the first <meta name="type"> tag. If this tag does not exist, use other.

    • title - use the text first <h1> HTML tag.

    • subtitle - use the content from the first meta name="type" tag. If this tag does not exist, use the text from the <div class="side-content"> of the first section with a <data-component-name="Hero Banner"> tag.

    • product - use the content from the first <meta name="product"> tag.

  6. In the Edit Taggers window, click Save. Then, on the Document Extractors page, click Save.

Create a crawler scheduler

Publish the source

You must publish the source to start the first scan and index.

Do you have some feedback for us?

If you have suggestions for improving this article,