Configure indexing

Crawler specifications

The following table shows which feature each crawler type supports:

Feature		Web crawler	Advanced web crawler	API crawler
Multiple entities	Extract attribute values from more than one entity	No	Yes	Yes
Content that can be crawled and parsed	HTML	Yes	Yes	Yes
	Microsoft Office formats	Yes	Yes	No
	PDF	Yes	Yes	No
	JSON	No	No	Yes
General Settings	Specify allowed domains	No	Yes	No
	Define a pattern to exclude some URLs	Yes	Yes	Yes
	Specify maximum crawler depth	Yes	Yes	Yes
	Specify maximum URLs that can be crawled	Yes	Yes	Yes
	Specify the number of workers to work in parallel	No	Yes	Yes
	Specify crawler timeouts	No	Yes	No
	Add headers to the crawler if original content expects a header	Yes	Yes	Yes
	Render JavaScript and crawl it in addition to the page source	No	Yes	n/a
Starting the crawl (Trigger)	Use a request URL	Yes	Yes	Yes
	Use a sitemap	Yes	Yes	n/a
	Use a sitemap index	Yes	Yes	n/a
	Use a JavaScript function	No	Yes	Yes
	Use an RSS feed	No	Yes	n/a
Extracting attributes (Document Extractor)	Use an XPath expression	Yes	Yes	n/a
	Use a CSS expression	No	Yes	n/a
	Use a JavaScript function	No	Yes	Yes
	Use JSONPath	No	No	Yes
	Match a specific URL pattern before extracting an attribute	No	Yes	Yes
	Create multiple rules to extract an attribute and then prioritize these rules	No	Yes	Yes
	Specify entity-based rules to extract attributes	No	Yes	Yes
Schedule scans to keep index documents up to date with your original content		Yes	Yes	Yes
Handle original content with multiple locales and languages		No	Yes	Yes
Handle original content that requires authentication		No	Yes	Yes
Add additional starting points not covered by the trigger (Request Extractor)		No	Yes	Yes
Use the Ingestion API to make incremental updates		No	Yes	Yes
Use entity-based tags		No	Yes	Yes

If you have suggestions for improving this article, let us know!