Best practices to crawl frequent new updates

This topic describes considerations when you want to capture frequent new changes to a site that has been previously crawled and indexed.

Create the crawler

  • Ensure that the crawler type is appropriate for the updated items: web crawler or API crawler.

  • Configure all the basic settings of the crawler including name, URL, and type.

Configure settings

  • Enable the crawler for incremental updates. It will index only items added or changed since the last crawl.

  • Confirm that the crawler focuses only on areas of the site with dynamic or frequently changing content.

  • Include all the URLs for the focused areas and their authentication information.

  • Confirm that crawler type can index all item types at the URLs or endpoints.

  • Support the crawler with extractors for PDFs, images, or localized content, as applicable.

Assign document extractors

  • Confirm the document extractors can extract values from crawled items. For example, XPath and JSON.

  • Ensure the document extractors are configured to handle incremental changes.

  • Validate and verify the document extractors accurately extract desired values.

Configure triggers

  • Request or JavaScript triggers to initiate incremental updates have been implemented.

  • Schedule trigger to run at regular intervals, such as hourly or every few hours while still matching to the frequency of the changes.

  • Set the trigger for a low-traffic time to minimize impact on site performance.

Add tags

  • For your categorization and prioritization needs, add tags to your crawler.

Do you have some feedback for us?

If you have suggestions for improving this article,