Deciding which source to use

You can create different types of sources in Sitecore Search. Depending on your business requirements, you can use one, some, or all of them.

To determine which type of source to configure to index your content, consider the following:

  • If you have content for only one locale and language, and all the content is available on an HTML page or in a Microsoft Outlook format like Word or PPT, create a web crawler. The web crawler is usually able to cover all basic crawling requirements.

  • If you create a web crawler but then reach a point where you need additional settings, convert it to an advanced web crawler. For example, If you want to handle authentication requirements, use JavaScript expressions to extract attributes for each index document, create index documents for content in multiple languages, and more, you'll need to use the Walkthrough: Configuring an advanced web crawler.

    We recommend starting with a web crawler and then converting it to an advanced web crawler if necessary.

  • If your content can only be accessed by an API endpoint, and the endpoint returns JSON, use an API crawler.

  • If you want to create a new index document and add it to an existing index, or quickly update or delete existing index documents, use the Ingestion API.

    For example, you have an advanced web crawler that frequently crawls a website for blogs and adds it to an index. You get a new blog that you urgently need to make available to your visitors and cannot wait for the next scheduled scan. Use the Ingestion API to add the blog.

  • If you want to create a placeholder index to add index documents that are not covered by any other source, create an API push source and then use the Ingestion API to add the index document. Later, you can use the Ingestion API to update or delete this index document.

This matrix lists detailed features that each crawler supports:

Feature

Web crawler

Advanced web crawler

API crawler

Supported source content type

HTML

Yes

Yes

Yes

Microsoft Office formats

Yes

Yes

No

PDF

Yes

Yes

No

JSON

No

No

Yes

General Settings

Specify allowed domains

No

Yes

No

Define a pattern to exclude some URLs

Yes

Yes

Yes

Specify maximum crawler depth

Yes

Yes

Yes

Specify maximum URLs that can be crawled

Yes

Yes

Yes

Specify the number of workers to work in parallel

No

Yes

Yes

Specify crawler timeouts

No

Yes

No

Add headers to the crawler if source content expects a header

Yes

Yes

Yes

Render JavaScript and crawl it in addition to the page source

No

Yes

n/a

Starting the crawl (Trigger)

Use a request URL

Yes

Yes

Yes

Use a sitemap

Yes

Yes

n/a

Use a sitemap Index

Yes

Yes

n/a

Use a JavaScript function

No

Yes

Yes

Use an RSS Feed

No

Yes

n/a

Extracting attributes (Document Extractor)

Use an XPath expression

Yes

Yes

n/a

Use a CSS expression

No

Yes

n/a

Use a JavaScript function

No

Yes

Yes

Use JSONPath

No

No

Yes

Match a specific URL pattern before extracting an attribute

No

Yes

Yes

Create multiple rules to extract an attribute and then prioritize these rules

No

Yes

Yes

Specify entity-based rules to extract attributes

No

Yes

Yes

Schedule scans to keep index documents up to date with your source content

Yes

Yes

Yes

Handle source content with multiple locales and languages

No

Yes

Yes

Handle source content that requires authentication

No

Yes

Yes

Add additional starting points not covered by the trigger (Request Extractor)

No

Yes

Yes

Use the Ingestion API to make incremental updates

No

Yes

Yes

Tag sources with the entity they apply to

No

Yes

Yes

Do you have some feedback for us?

If you have suggestions for improving this article,