Crawler troubleshooting
This topic helps you diagnose and resolve common errors that cause crawlers to fail.
Crawl fails immediately or no URLs are crawled
If a crawl does not start and no URLs are processed, typically the crawler is unable to retrieve or process the sitemap within its configured limited. This usually happens when the trigger timeout is too low for the target site. To allow sufficient time for sitemap retrieval and processing, increase the trigger timeout from the default 1,000 ms (1 second) to 10–60 seconds.
Crawler times out while fetching pages
If the crawler reports errors such as context deadline exceeded (Client.Timeout exceeded while awaiting headers), it indicates that the target site did not respond within the allowed time. The web crawler timeout is intentionally generous and should not normally be reached. Most pages should respond within a few seconds. Even in scenarios with large resources, such as large PDF downloads, responses should complete within minutes.
When pages approach this timeout repeatedly, the issue is usually not related to the crawler, but instead relates to server performance, network restrictions, or bot‑blocking by the site or Content Delivery Network (CDN). Increasing the crawler timeout does not resolve these issues, and leads to inefficient resource usage.
To fix the error:
-
Review server performance and response times.
-
Check for network or firewall restrictions.
-
Investigate bot‑blocking or request‑throttling rules on the host or CDN.
Individual URL validation succeeds, but crawling fails
When validating an individual URL works but a full crawl fails, the error is usually due to request volume. Validating a single URL is a lightweight request, whereas a crawl generates hundreds or thousands of requests over a short period, which is more likely to trigger security controls.
To fix the error:
-
Ask the customer to review access logs on the target site.
-
Ensure the crawler’s User‑Agent is allowlisted.
-
Ensure Sitecore Search crawler IPs are allowlisted, where applicable. For more information on user agent allowlisting, see ???.
Error messages are vague or misleading
Websites are not required to return detailed error responses when blocking automated traffic. Many websites intentionally delay or provide obscure errors as part of bot‑mitigation strategies. In these cases, the HTTP status code logged by the crawler provides the most reliable diagnostic message.
To fix the error:
-
Use the reported HTTP status code as the primary indicator.
-
Use server, CDN, or Web Application Firewall (WAF) logs to investigate the failure where possible.
Pages are visited but nothing is indexed
If pages are marked as visited but show Indexed Documents = 0, or Visited > 0 but Crawled = 0, the crawler is able to load the pages but cannot extract content. This is often caused by missing or outdated extraction selectors, JavaScript rendering restrictions, or host/CDN anti‑bot rules. This issue is frequently observed on platforms such as Vercel, where headless crawlers may receive incomplete or restricted content.
For example, an extractor may attempt to read content such as:
If the required element is not present at extraction time, the extractor returns no results, and the page is not crawled or indexed. To fix the error:
-
Disable Render JavaScript if the site is static or server‑rendered, so extraction can run against raw HTML.
-
Enable JavaScript rendering only for single‑page applications (SPAs) or pages where core content is loaded exclusively client‑side.
-
Allowlist all Sitecore Search crawler IPs if the hosting platform or CDN applies anti‑bot protection.
-
Confirm that extraction selectors exist in both raw HTML (JavaScript OFF) and the rendered DOM (JavaScript ON), and update or add fallbacks if needed.
-
Ask the customer to review host and CDN security policies, such as bot‑blocking rules, JavaScript execution restrictions, WAF filters, rate limits, geofencing, authentication, and cookie gates.
Delta crawling behaves unexpectedly
Optimized (delta) crawling is attempted based on the sitemap configuration. If the number of failed URLs exceeds the allowed threshold, the delta crawl fails and the next run automatically defaults to a full crawl. If fewer pages are indexed than expected, or a delta crawl defaults to a full crawl:
-
Remove invalid or unreachable URLs from the sitemap.
-
Exclude URLs that repeatedly fail.
Partial or blocked content is being indexed
To avoid indexing incomplete or blocked pages, configure extractor validation logic to require the presence of a mandatory field. If the field is missing, extraction fails and the document is excluded from indexing. This ensures that only fully rendered and accessible pages are ingested.
Recommended baseline configuration
When crawlers run frequently (for example, on an hourly schedule), repeated requests can increase the likelihood of blocking. If content changes infrequently, reducing the crawl schedule to a daily run can help lower request volume and improve stability. A sitemap index-based crawler with daily scheduling is appropriate for sites where content changes infrequently.
To keep crawls stable and efficient:
-
Maintain a clean, valid sitemap.
-
Set timeouts that reflect normal site response times.
-
Ensure consistent crawler access by maintaining stable allowlisting rules, avoiding temporary blocks, and ensuring that CDN, WAF, and firewall configurations do not change between crawls.
Troubleshooting checklist
When troubleshooting failed crawls:
-
Increase trigger timeout to 10–60 seconds.
-
Confirm User‑Agent and IP allowlisting.
-
Review CDN or WAF logs for blocking or rate limiting.
-
Disable Render JavaScript unless required.
-
Validate extraction selectors against raw and rendered HTML.
-
Remove failing URLs from the sitemap.
-
Re‑run the crawl and confirm Crawled + Indexed > 0.