Best practices to crawl complete websites
This topic describes some things to consider when preparing to index an entire website and capture content updates. We recommend you do this once a week.
Create the crawler
-
Ensure that the crawler type is appropriate for all searchable items: web crawler or API crawler.
-
Configure all the basic settings of the crawler including name, URL, and type.
Configure settings
-
Include all the URLs to be crawled and their authentication information. This way all items are crawled and indexed.
-
Confirm that the crawler type can index all item types at the URLs or endpoints.
-
Support the crawler with extractors for PDFs, images, or localized content, as applicable.
Assign document extractors
-
Confirm the document extractors can extract values from crawled items. For example, XPath and JSON.
-
Validate and verify the document extractors accurately extract desired values.
Configure triggers
-
Include a sitemap and set up a trigger for the crawl.
-
Schedule the crawler to run once a week.
-
Set the trigger for a low-traffic time to minimize impact on site performance.
Sitemap triggers will now leverage the last modified date (lastmod) field for each URL in your sitemap. This increases the indexing speed and improves the efficiency of your content updates.
Add tags
-
For your categorization and prioritization needs, add tags to your crawler.
Setting defaults and limits
To improve performance, the following defaults and limits are applied during crawler configuration:
|
Setting |
Default |
Limit |
|---|---|---|
|
Timeout |
10,000ms |
180,000ms (3 minutes) |
|
Max depth |
2 |
5 |
|
Parallelism |
1 |
10 |
|
Delay |
0 |
5000ms |
|
Max Wait (Only Web Crawler Advanced when Render JS is enabled) |
2000ms |
60,000ms (1 minute) |