Best practices to crawl complete websites

This topic describes some things to consider when preparing to index an entire website and capture content updates. We recommend you do this once a week.

Create the crawler

  • Ensure that the crawler type is appropriate for all searchable items: web crawler or API crawler.

  • Configure all the basic settings of the crawler including name, URL, and type.

Configure settings

  • Include all the URLs to be crawled and their authentication information. This way all items are crawled and indexed.

  • Confirm that the crawler type can index all item types at the URLs or endpoints.

  • Support the crawler with extractors for PDFs, images, or localized content, as applicable.

Assign document extractors

  • Confirm the document extractors can extract values from crawled items. For example, XPath and JSON.

  • Validate and verify the document extractors accurately extract desired values.

Configure triggers

  • Include a sitemap and set up a trigger for the crawl.

  • Schedule the crawler to run once a week.

  • Set the trigger for a low-traffic time to minimize impact on site performance.

Add tags

  • For your categorization and prioritization needs, add tags to your crawler.

Do you have some feedback for us?

If you have suggestions for improving this article,