Configure indexing

Search User
Index items
Configure indexing
Configure a localized advanced web crawler

Configure a localized advanced web crawler

In this walkthrough, you learn how to configure a source to crawl and index localized content. This includes configuring available locales, configuring locale extractors, and making sure that localized versions of the same content share the same ID.

Important

To index localized versions of crawlable items, you can only use the advanced web crawler.

This walkthrough describes how to:

Note

If your original content requires authentication before the crawler can access it, configure crawler authentication settings. For example, your original content might need a GUI-based user name and password, or an access token or key in the request header. You can do this at any point after source creation.

Create the crawler

To index localized items, you need an advanced web crawler. If you already have a standard web crawler, you can convert it to an advanced web crawler.

To create a new advanced web crawler:

On the menu bar, click Sources > Add Source.
In the SOURCE NAME field, enter a name for the source. For example, enter Doc site web crawler.
In the DESCRIPTION field, enter a few lines to describe the source you want to configure. For example, enter Web crawler to crawl all pages of the doc site.
In the CONNECTOR field, click Advanced web crawler as the type of source you want to create.
Click Save. If there are no errors, Search creates a new source.

After you create a source, configure the source to access and index content.

Configure the crawler's scope

You configure crawler settings to define important high-level configurations that define the scope of the web crawler.

To configure the scope of the advanced web crawler:

On the Source Settings page, click Edit next to Advanced Web Crawler Settings.
Optionally, in the ALLOWED DOMAINS field, enter the domains for the crawler to stay within. Do this to ensure that the web crawler only crawls Sitecore documentation domains, not external sites that might be linked.

For example, to ensure that the web crawler only crawls Sitecore documentation domains, not external sites that might be linked, enter www.doc.sitecore.com.
To configure the depth and number of URLs for the crawler to crawl:
- In the MAX DEPTH field, enter the maximum number of levels that you want the crawler to follow for a URL.
  
  For example, to limit the crawler to child pages of the main page, enter 1.
- In the MAX URLS field, enter the maximum number of URLs for the crawler to crawl, in total. Enter a large number to ensure that the crawler leaves no URL out.
  
  For example, enter 5000.
To exclude certain URL patterns from the crawler's scope, click Add Exclusion Pattern. Then, in the TYPE drop-down menu, select Glob expression or Regular Expression. In the VALUE field, enter the expression to match URLs to exclude.

For example, to prevent the crawler from crawling your search page, enter the following Glob expression:
**/search/**
To configure the number of workers to crawl in parallel and an optional delay between requests:
- Define the number of threads, or workers, that concurrently crawl and index content by clicking a value in the PARALLELISM (WORKERS) drop-down menu.
  
  For example, enter 2 to have only two workers crawl in parallel. This uses less memory than the default 5 workers.
- Optionally, if you have configured only one worker, you can define a time for the crawler to wait before it accesses the next URL to index. To do this, in the DELAY (MS) field, enter time in milliseconds.
  
  For example, enter 3.
In the TIMEOUT field, enter the time, in milliseconds, that the crawler waits to get a response.

For example, enter 5000. This ensures that the crawler waits for 5000 milliseconds, or 5 seconds, to get a response from every URL it crawls.
Optionally, to add headers, click Add Header. Then, in the Key field, enter the name of the user agent that your content expects. In the Value field, enter the value of the user agent that your content expects. This security measure ensures that only the Search crawler, and not any other crawler, can crawl the data.

For example, enter user-agent as the Key and sitecorebot as the Value.
To stop the crawler from accepting navigation cookies, turn off ENABLE NAVIGATION COOKIES DURING CRAWLING.

When you index content in multiple locales, it's important to disable cookie acceptance. This is because cookies can limit the crawler to the first locale it encounters, and the crawler might not index content across all locales.
Optionally, if you want the crawler to wait for and crawl the JavaScript on the page in addition to the page source, turn on Render JavaScript in the Additional Settings section,
Click Save.

Configure a sitemap trigger

The trigger gives the crawler a starting point to look for content to index.

To configure a sitemap or sitemap index trigger:

On the Source Settings page, click Edit next to Triggers.
Click Add Trigger.
In the Trigger Type drop-down menu:
- If you have a sitemap, click Sitemap.
- If you have a sitemap index, click Sitemap Index.
In the TIMEOUT field, enter the time, in milliseconds, that the crawler waits to get a response. Enter a large number here because sitemaps can take very long to load, and you do not want the crawler to time out when it reaches the maximum number of seconds to wait.

For example, enter 10000.
In the URL field, click Add Item and then enter the URL of the sitemap or sitemap index.

For example, enter https://www.sitecore.com/sitemap.xml.
Click Save.

Configure available locales

You configure available locales to define the subset of domain languages you want to use for this source.

For example, you want English (US), French (France), and Japanese (Japan) as locales. The crawler adds English (US) as a default locale, but you must add French (France) and Japanese (Japan).

Important

To ensure stable ingestion, do not include more than 40 locales per crawler source.

To configure locales:

On the Source Settings page, click Edit next to Available Locales.
On the Available Locales page, click inside the LOCALES field to see a list of locales available for this domain and then click the locales you want for this source.

For example, click fr_fr and ja_jp.
Click Save.

Configure a JavaScript locale extractor

You configure a locale extractor to define how the crawler must extract the locale from each URL it crawls.

Important

In the extractor logic, the locale format must match the locale formatting in Available Locales. That is, ${language}_${country} or ${language}-${country}.

In this example, you use a JavaScript function to define this rule.

Note

You can also configure a URL extractor (using a REGEX) or a header extractor.

To configure a JavaScript locale extractor:

On the Source Settings page, click Edit next to Locale Extractors.
To create a locale extractor, on the Locale Extractors page:
- In the Name field, enter a meaningful name for the extractor.
  
  For example, enter Sitecore.com JS locale extractor.
- In the Extractor Type drop-down menu, click JS.

In the JS Source field, enter a JS function in Cheerio syntax to extract locales.

Here's a sample JavaScript function to extract locales:

function extract(request, response) {
    locales = ['fr-fr','ja-jp'];
    for (idx in locales) {
        locale = locales[idx];
        if (request.url.indexOf('/' + locale + '/') >= 0) {
            return locale.toLowerCase().replace('-','_');
        }
    }
    return "en_us";
}

The function uses this logic:

If the page URL has a locale, get the locale from the URL of the page and format it in lower case and with an underscore. This formatting is important because it must match the locale formatting in Available Locales, that is, fr_fr and ja_jp.
If the page URL has no locale, assign the index document the default locale of en_us.

Click Save.

Configure a JavaScript document extractor for localized content

Configure a document extractor to specify which attributes you want the advanced web crawler to extract. When you have localized content, configure the id in addition to other attributes that you might require. You do this to ensure that localized versions of the same content share the same ID.

In this example, you use JavaScript to extract the id, type, title, subtitle, and product attributes.

To configure a JavaScript document extractor for localized content:

On the Source Settings page, click Edit next to Document Extractors.
To create a document extractor, on the Document Extractors page:
- In the Name field, enter a meaningful name for the extractor.
  
  For exampleSitecore with languages .
- In the Extractor Type drop-down menu, click JS.
In the Taggers section, click Add Tagger .

In the tag editor, you see a sample JavaScript function that returns the description, name, type, image_url and url attributes. Search provides this sample to help with configuration.
To enable localized content, turn on the Localized switch.

In the tag editor, paste a JavaScript function that returns attribute values. Ensure that the value of the id attribute is constant across localized versions of the same content.

The function must use Cheerio syntax and must return an array of objects.

For example, paste the following code:

function extract(request, response) {
    $ = response.body;
    url = request.url;
    locales = ['/fr-fr/', '/ja-jp/'];
    for (idx in locales) url = url.replace(locales[idx], '/');
    id = url.replaceAll('/', '_').replaceAll(':', '_').replaceAll('.', '_');
    return [{
        'id': id,
        'type': $('meta[name="type"]').attr('content') || 'Others',
        'title': $('h1').text(),
        'subtitle': $('meta[name="description"]').attr('content') || $('section[data-component-name="Hero Banner"] div[class*="side-content"]>div>p, header div.lead p').text(),
        'product': $('meta[name="product"]').attr('content')
    }];
}

The code uses the following logic to get attribute values:

id - get this from the URL after ignoring the locale. To generate an ID from a URL, first replace the locale with a slash (/). Then, replace all slashes (/), dots (.), and colons in the URL with an underscore (_).
type - use the content from the first <meta name="type"> tag. If this tag does not exist, use other.
title - use the text first <h1> HTML tag.
subtitle - use the content from the first meta name="type" tag. If this tag does not exist, use the text from the <div class="side-content"> of the first section with a <data-component-name="Hero Banner"> tag.
product - use the content from the first <meta name="product"> tag.

In the Edit Taggers window, click Save. Then, on the Document Extractors page, click Save.

Create a crawler scheduler

To create a crawler schedule:

On the menu bar, click Sources.
Click the source that you want to want to schedule crawls for.
On the Source Settings page, click Edit next to Crawler Scheduler.
To configure when you want Search to start scheduled crawls, click a value in the STARTS drop-down menu. If you want the schedule to start as soon as possible, select Anytime. If you want the crawl to start on a particular date, click Specific Date, and click a date in the date picker.
To have the crawl run on a regular schedule, in the REPEAT drop-down menu, click Yes.

Tip

To schedule a crawl to happen just once on a future date, in the REPEAT drop-down menu, click DOES NOT REPEAT. With this configuration, a crawl will happen on the date you selected in Step 4 and will not repeat.
To define the frequency of crawls, click values next to the Repeats every field. You can click any value from 1 to 99 as the interval and either days, weeks, or (for production domains only) hours as the unit of time. For example, if you want a crawl to run every four weeks, click 4 and weeks.
To define the time at which you want crawls to start, click a value in the RUN TIME drop-down menu. For example, if you want the crawl to start at midnight, click 12:00 AM. The time displayed is automatically adjusted to your timezone.
To configure when you want the crawler schedule to end, click a value in the END DATE drop-down menu. If you want the schedule to continue indefinitely, select Never. If you want the crawl to end on a particular date, click Specific Date, and click a date in the date picker.
Click Save.

Publish updates to the source

You must publish the source to start the first scan and index.

To publish a source:

On the menu bar, click Sources.
Click the source you want to publish and click Publish.
In the Publish Source dialog, if you want Search to start a recrawl for this source, select the Trigger source recrawl after publishing check box.
Click Publish.

If you have suggestions for improving this article, let us know!