Best practices for indexing

Best practices to crawl complete websites

This topic describes some things to consider when preparing to index an entire website and capture content updates. We recommend you do this once a week.

Create the crawler

Ensure that the crawler type is appropriate for all searchable items: web crawler or API crawler.
Configure all the basic settings of the crawler including name, URL, and type.

Configure settings

Include all the URLs to be crawled and their authentication information. This way all items are crawled and indexed.
Confirm that the crawler type can index all item types at the URLs or endpoints.
Support the crawler with extractors for PDFs, images, or localized content, as applicable.

Assign document extractors

Confirm the document extractors can extract values from crawled items. For example, XPath and JSON.
Validate and verify the document extractors accurately extract desired values.

Configure triggers

Include a sitemap and set up a trigger for the crawl.
Schedule the crawler to run once a week.
Set the trigger for a low-traffic time to minimize impact on site performance.

Note

Sitemap triggers will now leverage the last modified date (lastmod) field for each URL in your sitemap. This increases the indexing speed and improves the efficiency of your content updates.

Add tags

For your categorization and prioritization needs, add tags to your crawler.

Setting defaults and limits

To improve performance, the following defaults and limits are applied during crawler configuration:

Setting	Default	Limit
Timeout	10,000ms	180,000ms (3 minutes)
Max depth	2	5
Parallelism	1	10
Delay	0	5000ms
Max Wait (Only Web Crawler Advanced when Render JS is enabled)	2000ms	60,000ms (1 minute)

If you have suggestions for improving this article, let us know!