Configure an API crawler

The Sitecore Search API crawler is a powerful crawler specifically designed to handle JSON content. It supports complicated use cases like needing authentication to access your source content, creating index documents in multiple languages, using JavaScript to extract attribute values, and more.

The API crawler works by accessing URLs or API endpoints and indexing the content in each URL or endpoint.

This walkthrough describes how to:

Note

If your original content requires authentication before the crawler can access it, configure crawler authentication settings. For example, your original content might need a GUI-based user name and password, or an access token or key in the request header. You can do this at any point after source creation.

Create an API crawler source

  1. Here, click API Crawler.

Configure API crawler settings

Configure crawler settings to define important high-level configurations that define the scope of the API crawler.

To configure the scope of the API crawler:

  1. Click Sources and then click the source you created.

  2. On the Source Settings page, click Edit next to API Crawler Settings.

  3. Click Save.

Configure a request trigger

Configure triggers to give the advanced web crawler a starting point to look for content to index.

Note

For an API crawler, you can configure a request trigger or a JavaScript trigger.

In this example, you use a request trigger with a GraphQL API endpoint.

To configure a request trigger:

  1. On the Source Settings page, click Edit next to Triggers.

  2. Click Add Trigger.

  3. In the Trigger Type drop-down field, click Request.

  4. Optionally, in the Body field, enter the body of the request.

    Note

    If you use a POST or Patch request, you have to enter a request body.

    For example, use:

    RequestResponse
    {"query":"query getItem($path: String) {\n  item(language: \"en\", path: $path) {id path children {results {name}\n    }\n  }\n}\n","variables":{"path":"/sitecore/content/mvpsite"}}
    
  5. Optionally, to configure a header, click Add Header. Then, in the Key field, enter the name of the user agent that your content expects. In the Value field, enter the value of the user agent that your content expects. This security measure ensures that only the Search crawler, and not any other crawler, can crawl your data.

    For example, enter user-agent as the Key and sitecorebot as the Value.

  6. Optionally, in the Method drop-down menu, click the HTTPs method you want to use. default, GET is selected.

    For example, select POST.

  7. In the URL field, paste the API endpoint you want to use as the trigger.

    For example, paste https://edge.sitecorecloud.io/api/graphql/v1

Note

You do not see the trigger response in Sitecore Search. However, it is useful to keep it in mind because you later configure a request extractor uses the response of the trigger.

In this example, the trigger returns the following JSON response:

RequestResponse
{
    "data": {
        "item": {
            "id": "xxx",
            "path": "/sitecore/content/MvpSite",
            "children": {
                "results": [
                    {
                        "name": "Home"
                    },
                    {
                        "name": "MVP Repository"
                    },
                    {
                        "name": "Shared Content"
                    },
                    {
                        "name": "Settings"
                    }
                ]
            }
        }
    }

Configure a request extractor that returns API endpoints

A request extractor creates additional URLs for the crawler to crawl.

Request extractors are very important when you configure an API crawler. For an API crawler, triggers return JSON and not URLs. To handle this, configure a request extractor to use the output of the trigger and return URLs or API endpoints for the API crawler to crawl.

In this example, you need to configure a request extractor that uses the JSON objects that the trigger outputs, and generates API endpoints.

To create a request extractor that uses JSON objects as the input and returns a list of API endpoints:

  1. Click Sources and select the source you created.

  2. On the Source Settings page, next to Request Extractors, click Edit .

  3. To create a request extractor, on the Document Extractors page:

    • In the Name field, enter a meaningful name for the extractor.

      For example, enter Sitecore video URLs.

    • Optionally, in the URLs To Match field, select the TYPE of expression you want to use and enter the VALUE of that expression.

      For example, to crawl all URLs in this format: <some text>/homeloans/<some text>, select Glob Expression and enter the VALUE as **/homeloans/**.* .

  4. In the JS Source field, paste a JavaScript function that returns a list of URLs.

    Note

    The function must use Cheerio syntax and must return an array of objects.

    For example, paste:

    RequestResponse
    function extract(request, response) {
        requests = [];
        if (response.body && response.body.data && response.body.data.item && response.body.data.item.children) {
            requests = response.body.data.item.children.results.map((e, i) => {
                name = e.name;
                path = JSON.parse(request.body).variables.path + "/" + name;
                return {
                    url: request.url,
                    method: 'POST',
                    headers: {
                        'content-type': ['application/json']
                    },
                    body: JSON.stringify({
                        "query": "query getItem($path: String) {item(language: \"en\", path: $path) {id path rendered children {results {name}}}}",
                        "operationName": "getItem",
                        "variables": {
                            "path": path
                        }
                    })
                };
            });
        }
    
        return requests;
    }
    
    

    This function returns the following API endpoints:

    RequestResponse
    [
        {
            "url": "https://edge.sitecorecloud.io/api/graphql/v1",
            "method": "POST",
            "headers": {
                "content-type": [
                    "application/json"
                ]
            },
            "body": "{"query":"query getItem($path: String) {item(language: \\"en\\", path: $path) {id path rendered children {results {name}}}}","operationName":"getItem","variables":{"path":"/sitecore/content/Sugcon/SugconEuSxa/Home"}}"
        },
        {
            "url": "https://edge.sitecorecloud.io/api/graphql/v1",
            "method": "POST",
            "headers": {
                "content-type": [
                    "application/json"
                ]
            },
            "body": "{"query":"query getItem($path: String) {item(language: \\"en\\", path: $path) {id path rendered children {results {name}}}}","operationName":"getItem","variables":{"path":"/sitecore/content/Sugcon/SugconEuSxa/Media"}}"
        },
        {
            "url": "https://edge.sitecorecloud.io/api/graphql/v1",
            "method": "POST",
            "headers": {
                "content-type": [
                    "application/json"
                ]
            },
            "body": "{"query":"query getItem($path: String) {item(language: \\"en\\", path: $path) {id path rendered children {results {name}}}}","operationName":"getItem","variables":{"path":"/sitecore/content/Sugcon/SugconEuSxa/Data"}}"
        },
        .....
    ]
    
  5. Click Save.

Configure a JSONPath document extractor and use JavaScript to define URLs to match

Configure a document extractor to specify how to extract attribute values from each content item. The document extractor crawls the URLs or API endpoints that the request extractor generates.

Note

For an API crawler, you can configure a JavaScript extractor or a JSONPath document extractor.

Keep the following points in mind when you decide which document extractor to use:

  • If the trigger or request extractor output is JSON, you can use a JSONPath document extractor or a JavaScript document extractor

  • If the trigger or request extractor output is XML, use a JavaScript document extractor.

In this example, you use a JSONPath document extractor because the request extractor returns API endpoints, and the endpoints return a JSON response.

Schedule scans

Publish the source

You must publish the source for Search to start the first scan and index.

Do you have some feedback for us?

If you have suggestions for improving this article,