Crawler specifications
The following table shows which feature each crawler type supports:
Feature |
Web crawler |
Advanced web crawler |
API crawler | |
---|---|---|---|---|
Multiple entities |
Extract attribute values from more than one entity |
No |
Yes |
Yes |
Content that can be crawled and parsed |
HTML |
Yes |
Yes |
Yes |
Microsoft Office formats |
Yes |
Yes |
No | |
|
Yes |
Yes |
No | |
JSON |
No |
No |
Yes | |
General Settings |
Specify allowed domains |
No |
Yes |
No |
Define a pattern to exclude some URLs |
Yes |
Yes |
Yes | |
Specify maximum crawler depth |
Yes |
Yes |
Yes | |
Specify maximum URLs that can be crawled |
Yes |
Yes |
Yes | |
Specify the number of workers to work in parallel |
No |
Yes |
Yes | |
Specify crawler timeouts |
No |
Yes |
No | |
Add headers to the crawler if original content expects a header |
Yes |
Yes |
Yes | |
Render JavaScript and crawl it in addition to the page source |
No |
Yes |
n/a | |
Starting the crawl (Trigger) |
Use a request URL |
Yes |
Yes |
Yes |
Use a sitemap |
Yes |
Yes |
n/a | |
Use a sitemap index |
Yes |
Yes |
n/a | |
Use a JavaScript function |
No |
Yes |
Yes | |
Use an RSS feed |
No |
Yes |
n/a | |
Extracting attributes (Document Extractor) |
Use an XPath expression |
Yes |
Yes |
n/a |
Use a CSS expression |
No |
Yes |
n/a | |
Use a JavaScript function |
No |
Yes |
Yes | |
Use JSONPath |
No |
No |
Yes | |
Match a specific URL pattern before extracting an attribute |
No |
Yes |
Yes | |
Create multiple rules to extract an attribute and then prioritize these rules |
No |
Yes |
Yes | |
Specify entity-based rules to extract attributes |
No |
Yes |
Yes | |
Schedule scans to keep index documents up to date with your original content |
Yes |
Yes |
Yes | |
Handle original content with multiple locales and languages |
No |
Yes |
Yes | |
Handle original content that requires authentication |
No |
Yes |
Yes | |
Add additional starting points not covered by the trigger (Request Extractor) |
No |
Yes |
Yes | |
Use the Ingestion API to make incremental updates |
No |
Yes |
Yes | |
Use entity-based tags |
No |
Yes |
Yes |