Web crawler optimizations
Sitecore Search web crawlers support optimized crawling, called delta crawling, to ensure timely content updates in search results and improved indexing speed and efficiency.
Generally, a web crawler crawls all the URLs provided in the connector configuration. With delta crawling, the crawler only crawls URLs that have changed since the previous run.
This is achieved using the lastmod field in a sitemap.
Delta crawling is only applicable to web crawlers using the sitemap or sitemap_index trigger with the lastmod field available in the sitemap, and depth set to 0.
Delta crawling provides the following benefits:
-
Receive up-to-date search results due to rapid indexing and identification of content changes.
-
Users can trigger a full crawl from the Sitecore Search user interface for comprehensive site indexing.
-
Even if the
lastmoddate is absent, periodic full crawls ensure no updates are missed.
If a sitemap contains URLs without a lastmod field, they're crawled regardless of the status of the previous run.
Handling deleted URLs during delta crawling
During delta crawling, Search compares sitemap data from the current delta crawl with the previous successful delta crawl to identify URLs that have since been deleted. Documents associated with those deleted URLs are automatically deleted from the index to prevent outdated content from appearing in search results.
Deleted URL detection occurs only during delta crawls, not full crawls. Because sitemap comparison requires a previous delta crawl, deleted URL detection is performed only when a delta crawl follows a successful delta crawl.
URLs exceeding 4096 characters are excluded from deletion processing and remain indexed.
When an administrator recrawls a pull source, either manually or through scheduled crawls, Search updates all index documents with the latest data from your original content. This action overwrites any updates you made through Ingestion API if they aren't also applied to the original content. To prevent this, we recommend that you always update the original content with changes made through the Ingestion API. This synchronization ensures that your modifications are maintained in future crawls.