Crawler specifications

The following table shows which feature each crawler type supports:

Feature

Web crawler

Advanced web crawler

API crawler

Multiple entities

Extract attribute values from more than one entity

No

Yes

Yes

Content that can be crawled and parsed

HTML

Yes

Yes

Yes

Microsoft Office formats

Yes

Yes

No

PDF

Yes

Yes

No

JSON

No

No

Yes

General Settings

Specify allowed domains

No

Yes

No

Define a pattern to exclude some URLs

Yes

Yes

Yes

Specify maximum crawler depth

Yes

Yes

Yes

Specify maximum URLs that can be crawled

Yes

Yes

Yes

Specify the number of workers to work in parallel

No

Yes

Yes

Specify crawler timeouts

No

Yes

No

Add headers to the crawler if original content expects a header

Yes

Yes

Yes

Render JavaScript and crawl it in addition to the page source

No

Yes

n/a

Starting the crawl (Trigger)

Use a request URL

Yes

Yes

Yes

Use a sitemap

Yes

Yes

n/a

Use a sitemap index

Yes

Yes

n/a

Use a JavaScript function

No

Yes

Yes

Use an RSS feed

No

Yes

n/a

Extracting attributes (Document Extractor)

Use an XPath expression

Yes

Yes

n/a

Use a CSS expression

No

Yes

n/a

Use a JavaScript function

No

Yes

Yes

Use JSONPath

No

No

Yes

Match a specific URL pattern before extracting an attribute

No

Yes

Yes

Create multiple rules to extract an attribute and then prioritize these rules

No

Yes

Yes

Specify entity-based rules to extract attributes

No

Yes

Yes

Schedule scans to keep index documents up to date with your original content

Yes

Yes

Yes

Handle original content with multiple locales and languages

No

Yes

Yes

Handle original content that requires authentication

No

Yes

Yes

Add additional starting points not covered by the trigger (Request Extractor)

No

Yes

Yes

Use the Ingestion API to make incremental updates

No

Yes

Yes

Use entity-based tags

No

Yes

Yes

Do you have some feedback for us?

If you have suggestions for improving this article,