Best practices to crawl frequent new updates
This topic describes considerations when you want to capture frequent new changes to a site that has been previously crawled and indexed.
Create the crawler
-
Ensure that the crawler type is appropriate for the updated items: web crawler or API crawler.
-
Configure all the basic settings of the crawler including name, URL, and type.
Configure settings
-
Enable the crawler for incremental updates. It will index only items added or changed since the last crawl.
-
Confirm that the crawler focuses only on areas of the site with dynamic or frequently changing content.
-
Include all the URLs for the focused areas and their authentication information.
-
Confirm that crawler type can index all item types at the URLs or endpoints.
-
Support the crawler with extractors for PDFs, images, or localized content, as applicable.
Assign document extractors
-
Confirm the document extractors can extract values from crawled items. For example, XPath and JSON.
-
Ensure the document extractors are configured to handle incremental changes.
-
Validate and verify the document extractors accurately extract desired values.
Configure triggers
-
Request or JavaScript triggers to initiate incremental updates have been implemented.
-
Schedule trigger to run at regular intervals, such as hourly or every few hours while still matching to the frequency of the changes.
-
Set the trigger for a low-traffic time to minimize impact on site performance.
Add tags
-
For your categorization and prioritization needs, add tags to your crawler.