Configuring request extractors
A request extractor creates an additional list of URLs for the crawler to crawl. If you find that the crawler does not reach all the content you need to index by following the original starting point, the trigger , use a request extractor. The request extractor is a JavaScript function that uses the trigger output as its input.
For example, you have a sitemap that contains all HTML pages in your source content, and you use the sitemap URL as the trigger. But you also have PDF content embedded within some pages. The sitemap does not contain the URLs for the PDFs. In this case, configure request extractors to generate URLs to crawl only the PDF content.
Request extractors are very important when you configure an API crawler. For an API crawler, triggers return JSON and not URLs. To handle this, configure a request extractor to use the output of the trigger and return URLs or API endpoints for the API crawler to crawl.
To configure a request extractor, you add a JavaScript (JS) function that returns a list of URLs or API endpoints to crawl.
Configure the following settings to define request extractors for a crawler:
Setting |
Description |
---|---|
Name |
A meaningful name for the request extractor. |
URLs to Match |
This is an optional setting. |
JS Source |
JavaScript function that generates URLs or API endpoints for the crawler to crawl. |