Best practices to crawl complete websites
This topic describes some things to consider when preparing to index an entire website and capture content updates. We recommend you do this once a week.
Create the crawler
-
Ensure that the crawler type is appropriate for all searchable items: web crawler or API crawler.
-
Configure all the basic settings of the crawler including name, URL, and type.
Configure settings
-
Include all the URLs to be crawled and their authentication information. This way all items are crawled and indexed.
-
Confirm that the crawler type can index all item types at the URLs or endpoints.
-
Support the crawler with extractors for PDFs, images, or localized content, as applicable.
Assign document extractors
-
Confirm the document extractors can extract values from crawled items. For example, XPath and JSON.
-
Validate and verify the document extractors accurately extract desired values.
Configure triggers
-
Include a sitemap and set up a trigger for the crawl.
-
Schedule the crawler to run once a week.
-
Set the trigger for a low-traffic time to minimize impact on site performance.
Add tags
-
For your categorization and prioritization needs, add tags to your crawler.