Deciding which source to use
You can create different types of sources in Sitecore Search. Depending on your business requirements, you can use one, some, or all of them.
To determine which type of source to configure to index your content, consider the following:
-
If you have content for only one locale and language, and all the content is available on an HTML page or in a Microsoft Outlook format like Word or PPT, create a web crawler. The web crawler is usually able to cover all basic crawling requirements.
-
If you create a web crawler but then reach a point where you need additional settings, convert it to an advanced web crawler. For example, If you want to handle authentication requirements, use JavaScript expressions to extract attributes for each index document, create index documents for content in multiple languages, and more, you'll need to use the Walkthrough: Configuring an advanced web crawler.
We recommend starting with a web crawler and then converting it to an advanced web crawler if necessary.
-
If your content can only be accessed by an API endpoint, and the endpoint returns JSON, use an API crawler.
-
If you want to create a new index document and add it to an existing index, or quickly update or delete existing index documents, use the Ingestion API.
For example, you have an advanced web crawler that frequently crawls a website for blogs and adds it to an index. You get a new blog that you urgently need to make available to your visitors and cannot wait for the next scheduled scan. Use the Ingestion API to add the blog.
-
If you want to create a placeholder index to add index documents that are not covered by any other source, create an API push source and then use the Ingestion API to add the index document. Later, you can use the Ingestion API to update or delete this index document.
This matrix lists detailed features that each crawler supports:
Feature |
Web crawler |
Advanced web crawler |
API crawler | |
---|---|---|---|---|
Supported source content type |
HTML |
Yes |
Yes |
Yes |
Microsoft Office formats |
Yes |
Yes |
No | |
|
Yes |
Yes |
No | |
JSON |
No |
No |
Yes | |
General Settings |
Specify allowed domains |
No |
Yes |
No |
Define a pattern to exclude some URLs |
Yes |
Yes |
Yes | |
Specify maximum crawler depth |
Yes |
Yes |
Yes | |
Specify maximum URLs that can be crawled |
Yes |
Yes |
Yes | |
Specify the number of workers to work in parallel |
No |
Yes |
Yes | |
Specify crawler timeouts |
No |
Yes |
No | |
Add headers to the crawler if source content expects a header |
Yes |
Yes |
Yes | |
Render JavaScript and crawl it in addition to the page source |
No |
Yes |
n/a | |
Starting the crawl (Trigger) |
Use a request URL |
Yes |
Yes |
Yes |
Use a sitemap |
Yes |
Yes |
n/a | |
Use a sitemap Index |
Yes |
Yes |
n/a | |
Use a JavaScript function |
No |
Yes |
Yes | |
Use an RSS Feed |
No |
Yes |
n/a | |
Extracting attributes (Document Extractor) |
Use an XPath expression |
Yes |
Yes |
n/a |
Use a CSS expression |
No |
Yes |
n/a | |
Use a JavaScript function |
No |
Yes |
Yes | |
Use JSONPath |
No |
No |
Yes | |
Match a specific URL pattern before extracting an attribute |
No |
Yes |
Yes | |
Create multiple rules to extract an attribute and then prioritize these rules |
No |
Yes |
Yes | |
Specify entity-based rules to extract attributes |
No |
Yes |
Yes | |
Schedule scans to keep index documents up to date with your source content |
Yes |
Yes |
Yes | |
Handle source content with multiple locales and languages |
No |
Yes |
Yes | |
Handle source content that requires authentication |
No |
Yes |
Yes | |
Add additional starting points not covered by the trigger (Request Extractor) |
No |
Yes |
Yes | |
Use the Ingestion API to make incremental updates |
No |
Yes |
Yes | |
Tag sources with the entity they apply to |
No |
Yes |
Yes |