Configure an API crawler

The Sitecore Search API crawler is a powerful crawler specifically designed to handle JSON content. It supports complicated use cases like needing authentication to access your source content, creating index documents in multiple languages, using JavaScript to extract attribute values, and more.

The API crawler works by accessing URLs or API endpoints and indexing the content in each URL or endpoint.

Important

Refer to detailed specifications of crawler types when choosing an indexing method.

Follow these best practices to successfully crawl complete websites or crawl updates .

This walkthrough describes how to:

Note

If your original content requires authentication before the crawler can access it, configure crawler authentication settings. For example, your original content might need a GUI-based user name and password, or an access token or key in the request header. You can do this at any point after source creation.

Create a source

To create a source:

  1. In the CONNECTOR drop-down list, click API Crawler.

Configure crawler settings

Configure crawler settings to define important high-level configurations that define the scope of the API crawler.

To configure API crawler settings:

  1. Click Sources and then click the source you created.

  2. On the Source Settings page, next to API Crawler Settings, click Edit .

  3. Click Save.

Configure a trigger

Configure triggers to give the advanced web crawler a starting point to look for content to index. You can use the following types of triggers:

To configure a request trigger with a GraphQL API endpoint:

  1. On the Source Settings page, next to Triggers, click Edit .

  2. Click Add Trigger.

  3. In the Trigger Type drop-down field, click Request.

  4. Optionally, in the Body field, enter the body of the request.

    Note

    If you use a POST or Patch request, you have to enter a request body.

    For example, use:

    RequestResponse
    {"query":"query getItem($path: String) {\n  item(language: \"en\", path: $path) {id path children {results {name}\n    }\n  }\n}\n","variables":{"path":"/sitecore/content/mvpsite"}}
  5. Optionally, to configure a header, click Add Header. Then, in the Key field, enter the name of the user agent that your content expects. In the Value field, enter the value of the user agent that your content expects. This security measure ensures that only the Search crawler, and not any other crawler, can crawl your data.

    For example, enter user-agent as the Key and sitecorebot as the Value.

  6. Optionally, in the Method drop-down menu, click the HTTP method you want to use. The default, GET, is selected.

    For example, click POST.

  7. In the URL field, paste the API endpoint you want to use as the trigger.

    For example, paste https://edge.sitecorecloud.io/api/graphql/v1

Note

You do not see the trigger response in Sitecore Search. However, it is useful to keep it in mind because you later configure a request extractor that uses the response of the trigger.

In this example, the trigger returns the following JSON response:

RequestResponse
{
    "data": {
        "item": {
            "id": "xxx",
            "path": "/sitecore/content/MvpSite",
            "children": {
                "results": [
                    {
                        "name": "Home"
                    },
                    {
                        "name": "MVP Repository"
                    },
                    {
                        "name": "Shared Content"
                    },
                    {
                        "name": "Settings"
                    }
                ]
            }
        }
    }

Configure a request extractor

A request extractor creates additional URLs for the crawler to crawl.

Request extractors are very important when you configure an API crawler. For an API crawler, triggers return JSON and not URLs. To handle this, configure a request extractor to use the output of the trigger and return URLs or API endpoints for the API crawler to crawl.

In this example, you need to configure a request extractor that uses the JSON objects that the trigger outputs, and generates API endpoints.

To create a request extractor that uses JSON objects as the input and returns a list of API endpoints:

  1. Click Sources and select the source you created.

  2. On the Source Settings page, next to Request Extractors, click Edit .

  3. To create a request extractor, on the Document Extractors page:

    • In the Name field, enter a meaningful name for the extractor.

      For example, enter Sitecore video URLs.

    • Optionally, in the URLs To Match field, select the TYPE of expression you want to use and enter the VALUE of that expression.

      For example, to crawl all URLs in this format: <some text>/homeloans/<some text>, select Glob Expression and enter the VALUE as **/homeloans/**.* .

  4. In the JS Source field, paste a JavaScript function that returns a list of URLs.

    Note

    The function must use Cheerio syntax and must return an array of objects.

    For example, paste:

    RequestResponse
    function extract(request, response) {
        requests = [];
        if (response.body && response.body.data && response.body.data.item && response.body.data.item.children) {
            requests = response.body.data.item.children.results.map((e, i) => {
                name = e.name;
                path = JSON.parse(request.body).variables.path + "/" + name;
                return {
                    url: request.url,
                    method: 'POST',
                    headers: {
                        'content-type': ['application/json']
                    },
                    body: JSON.stringify({
                        "query": "query getItem($path: String) {item(language: \"en\", path: $path) {id path rendered children {results {name}}}}",
                        "operationName": "getItem",
                        "variables": {
                            "path": path
                        }
                    })
                };
            });
        }
    
        return requests;
    }
    

    This function returns the following API endpoints:

    RequestResponse
    [
        {
            "url": "https://edge.sitecorecloud.io/api/graphql/v1",
            "method": "POST",
            "headers": {
                "content-type": [
                    "application/json"
                ]
            },
            "body": "{"query":"query getItem($path: String) {item(language: \\"en\\", path: $path) {id path rendered children {results {name}}}}","operationName":"getItem","variables":{"path":"/sitecore/content/Sugcon/SugconEuSxa/Home"}}"
        },
        {
            "url": "https://edge.sitecorecloud.io/api/graphql/v1",
            "method": "POST",
            "headers": {
                "content-type": [
                    "application/json"
                ]
            },
            "body": "{"query":"query getItem($path: String) {item(language: \\"en\\", path: $path) {id path rendered children {results {name}}}}","operationName":"getItem","variables":{"path":"/sitecore/content/Sugcon/SugconEuSxa/Media"}}"
        },
        {
            "url": "https://edge.sitecorecloud.io/api/graphql/v1",
            "method": "POST",
            "headers": {
                "content-type": [
                    "application/json"
                ]
            },
            "body": "{"query":"query getItem($path: String) {item(language: \\"en\\", path: $path) {id path rendered children {results {name}}}}","operationName":"getItem","variables":{"path":"/sitecore/content/Sugcon/SugconEuSxa/Data"}}"
        },
        .....
    ]
  5. Click Save.

Configure a document extractor

Configure a document extractor to specify how to extract attribute values from each content item. The document extractor crawls the URLs or API endpoints that the request extractor generates.

You can use the following types of document extractors:

Note

For an API crawler, you can configure a JavaScript extractor or a JSONPath document extractor.

Keep the following points in mind when you decide which document extractor to use:

  • If the trigger or request extractor output is JSON, you can use a JSONPath document extractor or a JavaScript document extractor

  • If the trigger or request extractor output is XML, use a JavaScript document extractor.

Schedule scans

Publish updates to the source

You must publish the source to start the first scan and index.

Do you have some feedback for us?

If you have suggestions for improving this article,