Crawler specifications
The following table shows which feature each crawler type supports:
|
Feature |
Web crawler |
Advanced web crawler |
API crawler | |
|---|---|---|---|---|
|
Multiple entities |
Extract attribute values from more than one entity |
No |
Yes |
Yes |
|
Content that can be crawled and parsed |
HTML |
Yes |
Yes |
Yes |
|
Microsoft Office formats |
Yes |
Yes |
No | |
|
|
Yes |
Yes |
No | |
|
JSON |
No |
No |
Yes | |
|
General Settings |
Specify allowed domains |
No |
Yes |
No |
|
Define a pattern to exclude some URLs |
Yes |
Yes |
Yes | |
|
Specify maximum crawler depth |
Yes |
Yes |
Yes | |
|
Specify maximum URLs that can be crawled |
Yes |
Yes |
Yes | |
|
Specify the number of workers to work in parallel |
No |
Yes |
Yes | |
|
Specify crawler timeouts |
No |
Yes |
No | |
|
Add headers to the crawler if original content expects a header |
Yes |
Yes |
Yes | |
|
Render JavaScript and crawl it in addition to the page source |
No |
Yes |
n/a | |
|
Starting the crawl (Trigger) |
Use a request URL |
Yes |
Yes |
Yes |
|
Use a sitemap |
Yes |
Yes |
n/a | |
|
Use a sitemap index |
Yes |
Yes |
n/a | |
|
Use a JavaScript function |
No |
Yes |
Yes | |
|
Use an RSS feed |
No |
Yes |
n/a | |
|
Extracting attributes (Document Extractor) |
Use an XPath expression |
Yes |
Yes |
n/a |
|
Use a CSS expression |
No |
Yes |
n/a | |
|
Use a JavaScript function |
No |
Yes |
Yes | |
|
Use JSONPath |
No |
No |
Yes | |
|
Match a specific URL pattern before extracting an attribute |
No |
Yes |
Yes | |
|
Create multiple rules to extract an attribute and then prioritize these rules |
No |
Yes |
Yes | |
|
Specify entity-based rules to extract attributes |
No |
Yes |
Yes | |
|
Schedule scans to keep index documents up to date with your original content |
Yes |
Yes |
Yes | |
|
Handle original content with multiple locales and languages |
No |
Yes |
Yes | |
|
Handle original content that requires authentication |
No |
Yes |
Yes | |
|
Add additional starting points not covered by the trigger (Request Extractor) |
No |
Yes |
Yes | |
|
Use the Ingestion API to make incremental updates |
No |
Yes |
Yes | |
|
Use entity-based tags |
No |
Yes |
Yes | |