Configure indexing

Search User
Index items
Configure indexing
Configure an API crawler

Configure an API crawler

The Sitecore Search API crawler is a powerful crawler specifically designed to handle JSON content. It supports complicated use cases like needing authentication to access your source content, creating index documents in multiple languages, using JavaScript to extract attribute values, and more.

The API crawler works by accessing URLs or API endpoints and indexing the content in each URL or endpoint.

Important

Refer to detailed specifications of crawler types when choosing an indexing method.

Follow these indexing best practices to successfully crawl complete websites or crawl frequent new updates.

This walkthrough describes how to:

Note

If your original content requires authentication before the crawler can access it, configure crawler authentication settings. For example, your original content might need a GUI-based user name and password, or an access token or key in the request header. You can do this at any point after source creation.

Create an API crawler source

To create a source:

On the menu bar, click Sources.
Click Add Source.
In the SOURCE NAME field, enter a name for the source.
In the DESCRIPTION field, enter a few lines to describe the source you want to configure.
In the CONNECTOR drop-down list, click API Crawler.
Click Save. If there are no errors, Search creates a new source.

Configure crawler settings

Configure crawler settings to define important high-level configurations that define the scope of the API crawler.

To configure API crawler settings:

Click Sources and then click the source you created.
On the Source Settings page, next to API Crawler Settings, click Edit .
To configure the depth and number of URLs for the crawler to crawl:
- In the MAX DEPTH field, enter the maximum number of levels that you want the crawler to follow for a URL.
  
  For example, enter 5.
- In the MAX URLS field, enter the maximum number of URLs for the crawler to crawl, in total. Enter a large number to ensure that the crawler leaves no URL out.
  
  For example, enter 5000.
To configure the number of workers to crawl in parallel and an optional delay between requests:
- Define the number of threads, or workers, that concurrently crawl and index content by clicking a value in the PARALLELISM (WORKERS) drop-down menu.
  
  For example, enter 2 to have only two workers crawl in parallel. This uses less memory than the default 5 workers.
- Optionally, if you have configured only one worker, you can define a time for the crawler to wait before it accesses the next URL to index. To do this, in the DELAY (MS) field, enter time in milliseconds.
  
  For example, enter 3.
In the TIMEOUT field, enter the time, in milliseconds, that the crawler waits to get a response. The default is 10,000ms.

For example, enter 5000. This ensures that the crawler waits for 5000 milliseconds, or 5 seconds, to get a response from every URL it crawls.
Optionally, to add headers, click Add Header. Then, in the Key field, enter the name of the user agent that your content expects. In the Value field, enter the value of the user agent that your content expects. This security measure ensures that only the Search crawler, and not any other crawler, can crawl the data.

For example, enter user-agent as the Key and sitecorebot as the Value.
Optionally, if you want to stop the crawler from accepting navigation cookies, turn off ENABLE NAVIGATION COOKIES DURING CRAWLING in the Additional Settings section.

Navigation cookies track the crawler's path and record the URLs it visits. Sometimes, cookies might mislead the crawler into re-indexing previously visited URLs. However, cookies are important for websites that need them for subsequent access, especially websites that need authentication.
Click Save.

Configure a trigger

Configure triggers to give the advanced web crawler a starting point to look for content to index. You can use the following types of triggers:

To configure a request trigger with a GraphQL API endpoint:

On the Source Settings page, next to Triggers, click Edit .
Click Add Trigger.
In the Trigger Type drop-down list, click Request.

Optionally, in the Body field, enter the body of the request.

Note

If you use a POST or Patch request, you have to enter a request body.

For example, use:

{"query":"query getItem($path: String) {\n  item(language: \"en\", path: $path) {id path children {results {name}\n    }\n  }\n}\n","variables":{"path":"/sitecore/content/mvpsite"}}

Optionally, to configure a header, click Add Header. Then, in the Key field, enter the name of the user agent that your content expects. In the Value field, enter the value of the user agent that your content expects.

For example, enter user-agent as the Key and sitecorebot as the Value.

Note

This security measure ensures that only the Search crawler, and not any other crawler, can crawl your data.
Optionally, in the Method drop-down menu, click the HTTP method you want to use. The default, GET, is selected.

For example, click POST.
In the URL field, enter the API endpoint you want to use as the trigger.

For example, paste https://edge.sitecorecloud.io/api/graphql/v1

Note

You do not see the trigger response in Sitecore Search. However, it is useful to keep it in mind because you later configure a request extractor that uses the response of the trigger.

In this example, the trigger returns the following JSON response:

{
    "data": {
        "item": {
            "id": "xxx",
            "path": "/sitecore/content/MvpSite",
            "children": {
                "results": [
                    {
                        "name": "Home"
                    },
                    {
                        "name": "MVP Repository"
                    },
                    {
                        "name": "Shared Content"
                    },
                    {
                        "name": "Settings"
                    }
                ]
            }
        }
    }

Configure a request extractor

A request extractor creates additional URLs for the crawler to crawl.

Request extractors are very important when you configure an API crawler. For an API crawler, triggers return JSON and not URLs. To handle this, configure a request extractor to use the output of the trigger and return URLs or API endpoints for the API crawler to crawl.

In this example, you need to configure a request extractor that uses the JSON objects that the trigger outputs, and generates API endpoints.

To create a request extractor that uses JSON objects as the input and returns a list of API endpoints:

Click Sources and select the source you created.
On the Source Settings page, next to Request Extractors, click Edit .
To create a request extractor, on the Document Extractors page:
- In the Name field, enter a meaningful name for the extractor.
  
  For example, enter Sitecore video URLs.
- Optionally, in the URLs To Match field, select the TYPE of expression you want to use and enter the VALUE of that expression.
  
  For example, to crawl all URLs in this format: <some text>/homeloans/<some text>, select Glob Expression and enter the VALUE as **/homeloans/**.* .

In the JS Source field, paste a JavaScript function that returns a list of URLs.

Note

The function must use Cheerio syntax and must return an array of objects.

For example, paste:

function extract(request, response) {
    requests = [];
    if (response.body && response.body.data && response.body.data.item && response.body.data.item.children) {
        requests = response.body.data.item.children.results.map((e, i) => {
            name = e.name;
            path = JSON.parse(request.body).variables.path + "/" + name;
            return {
                url: request.url,
                method: 'POST',
                headers: {
                    'content-type': ['application/json']
                },
                body: JSON.stringify({
                    "query": "query getItem($path: String) {item(language: \"en\", path: $path) {id path rendered children {results {name}}}}",
                    "operationName": "getItem",
                    "variables": {
                        "path": path
                    }
                })
            };
        });
    }

    return requests;
}

This function returns the following API endpoints:

[
    {
        "url": "https://edge.sitecorecloud.io/api/graphql/v1",
        "method": "POST",
        "headers": {
            "content-type": [
                "application/json"
            ]
        },
        "body": "{"query":"query getItem($path: String) {item(language: \\"en\\", path: $path) {id path rendered children {results {name}}}}","operationName":"getItem","variables":{"path":"/sitecore/content/Sugcon/SugconEuSxa/Home"}}"
    },
    {
        "url": "https://edge.sitecorecloud.io/api/graphql/v1",
        "method": "POST",
        "headers": {
            "content-type": [
                "application/json"
            ]
        },
        "body": "{"query":"query getItem($path: String) {item(language: \\"en\\", path: $path) {id path rendered children {results {name}}}}","operationName":"getItem","variables":{"path":"/sitecore/content/Sugcon/SugconEuSxa/Media"}}"
    },
    {
        "url": "https://edge.sitecorecloud.io/api/graphql/v1",
        "method": "POST",
        "headers": {
            "content-type": [
                "application/json"
            ]
        },
        "body": "{"query":"query getItem($path: String) {item(language: \\"en\\", path: $path) {id path rendered children {results {name}}}}","operationName":"getItem","variables":{"path":"/sitecore/content/Sugcon/SugconEuSxa/Data"}}"
    },
    .....
]

Click Save.

Configure a document extractor

Configure a document extractor to specify how to extract attribute values from each content item. The document extractor crawls the URLs or API endpoints that the request extractor generates.

You can use the following types of document extractors:

Note

For an API crawler, you can configure a JavaScript extractor or a JSONPath document extractor.

Keep the following points in mind when you decide which document extractor to use:

If the trigger or request extractor output is JSON, you can use a JSONPath document extractor or a JavaScript document extractor
If the trigger or request extractor output is XML, use a JavaScript document extractor.

To create a JSONPath document extractor matching URLs with a JavaScript function:

On the menu bar, click Sources.
Select the source you created.
On the Source Settings page, next to Document Extractors, click Edit .
To create a document extractor, on the Document Extractors page:
- In the Name field, enter a meaningful name for the extractor.
  
  For example, enter Sitecore cloud.
- In the Extractor Type drop-down menu, click JSONPath.
- Optionally, to ensure that this extractor's logic only applies to URLs that match a certain pattern, configure URLs to Match. To do this, in the URLs To Match field, click Add Matcher, select the TYPE of expression you want to use, and enter the VALUE of that expression.
  
  For example, click JS in the TYPE drop-down menu and enter the following expression to use JavaScript to ensure that the extractor only extracts attributes from API endpoints whose response has a value for body.data.item.rendered.
  function match(request, response) { return response.body.data.item.rendered != null && response.body.data.item.rendered.sitecore.route.placeholders['headless-main'].length > 0; }
In the Taggers section, click Add Tagger . Then, in the tag editor, select a tag in the Tag drop-down menu. For example, select content.
Note
In a document extractor, you can create multiple taggers where each is linked to a unique tag. This way each tagger:
Generates a set of index documents.

Can have multiple rules such that each rule defines the extraction logic of one attribute.
For example, one tagger with five rules yields one set of documents each with five attributes.

Three taggers with one rule each yields three sets of documents each with one attribute.
In the tag editor, enter the following details to extract an attribute:
- In the Attribute drop-down menu, click the attribute you want to configure.
  
  For example, click Description.
- In the Value type drop-down menu, choose whether you want the attribute value to be a fixed value or an expression.
  
  For example, click Expressions.
- In the EXPRESSION field, enter a JSONPath expression that results in the attribute value.
  
  For example, to get the value of description from the of the ..placeholders['headless-main']..fields.Description key, enter
  ..placeholders['headless-main']..fields.Description.value
Optionally, to configure more than one way to get an attribute value, click Add Selector. Then, in the Expression field of the second selector, enter the JSONPath expression that results in the attribute value.

For example, as a second option, you want to get the value of description from the ..placeholders['headless-main']..fields.Text tag. To get this, enter the following JSONPath expression:
..placeholders['headless-main']..fields.Text.value
Note

When there are multiple selectors, Search runs them in chronological order and stops when it arrives at an expression that gives a result.
To configure how to extract other attributes, click Add Rule and repeat steps 5 and 6.

For example, now that you've configured how to extract the description attribute, you can configure how to extract the title, subtitle, and image attributes.
In the tag editor, click Save.
(Optional) To extract attributes for another tag, click Add Tagger, and in the Tag drop-down menu, click a tag and repeat Steps 5 through 8.
On the Document Extractors page, click Save.

Schedule scans

To create a crawler schedule:

On the menu bar, click Sources.
Click the source that you want to want to schedule crawls for.
On the Source Settings page, click Edit next to Crawler Scheduler.
To configure when you want Search to start scheduled crawls, click a value in the STARTS drop-down menu. If you want the schedule to start as soon as possible, select Anytime. If you want the crawl to start on a particular date, click Specific Date, and click a date in the date picker.
To have the crawl run on a regular schedule, in the REPEAT drop-down menu, click Yes.

Tip

To schedule a crawl to happen just once on a future date, in the REPEAT drop-down menu, click DOES NOT REPEAT. With this configuration, a crawl will happen on the date you selected in Step 4 and will not repeat.
To define the frequency of crawls, click values next to the Repeats every field. You can click any value from 1 to 99 as the interval and either days, weeks, or (for production domains only) hours as the unit of time. For example, if you want a crawl to run every four weeks, click 4 and weeks.
To define the time at which you want crawls to start, click a value in the RUN TIME drop-down menu. For example, if you want the crawl to start at midnight, click 12:00 AM. The time displayed is automatically adjusted to your timezone.
To configure when you want the crawler schedule to end, click a value in the END DATE drop-down menu. If you want the schedule to continue indefinitely, select Never. If you want the crawl to end on a particular date, click Specific Date, and click a date in the date picker.
Click Save.

Publish updates to the source

You must publish the source to start the first scan and index.

To publish a source:

On the menu bar, click Sources.
Click the source you want to publish and click Publish.
In the Publish Source dialog, if you want Search to start a recrawl for this source, select the Trigger source recrawl after publishing check box.
Click Publish.

If you have suggestions for improving this article, let us know!