Sitecore Cortex Content Tagging architecture

Abstract

The architecture of Sitecore CortexTM Content Tagging.

This topic describes the architecture of the Sitecore CortexTM Content Tagging feature in Sitecore.  This topic contains the following sections:

The Sitecore Cortex Content Tagging feature in Sitecore consists of the following:

  • Providers (IContentProvider, IDiscoveryProvider, ITaxonomyProvider, and ITagger) – contain business logic that performs content tagging operations.

  • Configuration services (IItemContentTaggingProviderSetBuilder, IItemContentTaggingConfigurationService) – enable you to build a combination of providers that provide content tagging operations, based on the configuration.

  • Pipelines (getTaggingConfiguration, tagContent, normalizeContent) – give you extension points to inject custom logic into the content tagging process.

The process of content tagging consists of four steps. For each step, there is an abstraction:

The following diagram shows the dependencies between all provider types:

ContentTagging_ProviderDependencyDiagram.png

You can configure each part of the content tagging process. When a user triggers the tagging process, the getTaggingConfiguration pipeline reads the Sitecore configuration and builds a named set of providers based on the configuration.

  • The IItemContentTaggingConfigurationService service reads the names of providers that are specified in the content tagging configuration and returns the ItemContentTaggingConfiguration object.

  • The IItemContentTaggingProviderSetBuilder service uses the ItemContentTaggingConfiguration object to build a set of providers that will be used for content tagging.

The getTaggingConfiguration pipeline reads the configuration name and then builds a provider set for this configuration.

The tagContent pipeline uses a set of providers created by the getTaggingConfiguration pipeline to provide content tagging. The tagContent pipeline consists of the following pipeline processors:

  • RetrieveContent – uses the configured content provider to get taggable content from the context item.

  • Normalize – takes TaggableContent objects and provides some processing in order to normalize content before passing it to the GetTags pipeline processor.

  • GetTags – gets TagData objects for TaggableContent objects. Uses the configured discovery provider for tagging. The output is the list of TagData objects related to the input content.

  • StoreTags – stores received tags. Uses the configured taxonomy provider. The default implementation will create tags items in the Sitecore tag repository.

  • ApplyTags – marks the context item with tags. Adds tag item IDs, created by the StoreTags pipeline processor, to the context item’s Semantics field under the Tagging section of the Item. Uses the configured tagger provider.

The normalizeContent pipeline is a separate pipeline to prepare TaggableContent objects for tagging. It is triggered by the Normalize pipeline processor in the tagContent pipeline.

The code for Sitecore Cortex Content Tagging is broken down into three DLLs.

The Sitecore.ContentTagging.Core DLL contains abstractions, default implementations, and infrastructural code. You can reference this DLL to run parts of Sitecore Cortex Content Tagging. For example, in order to get tags for some text without storing the tags, you can use the IDiscoveryProvider CreateDiscoveryProvider(string providerName) method to instantiate a discovery provider that is registered by name in the config file. You can use the IContentTaggingProviderFactory interface to get an instance of any of the four types of provider by name.

The Sitecore.ContentTagging DLL integrates Sitecore Cortex Content Tagging with Sitecore. This DLL contains the infrastructure to run content tagging from the Sitecore UI. It contains extension points (pipelines).  

The Sitecore.ContentTagging.OpenCalais DLL implements the discovery provider for Refinitiv Intelligent Tagging Open Calais. This allows Sitecore to use Open Calais for content tagging.

The configuration file contains the <contentTagging> section. This contains the following:

  • <providers> contains all registered providers grouped into the following sections:

    • <content> aggregates IContentProvider implementations  

    • <discovery> aggregates IDiscoveryProvider implementations

    • <tagger> aggregates ITagger implementations

    • <taxonomy> aggregates ITaxonomyProvider implementations

  • <configurations> defines different configuration sets using providers defined in the <providers> section.

<contentTagging>
    <providers>
        <content>
            <add name="DefaultContentProvider" type="Sitecore.ContentTagging.Core.Providers.DefaultContentProvider,
Sitecore.ContentTagging.Core" />
        </content>
        <discovery>
            <add name="DefaultDiscoveryProvider" type="Sitecore.ContentTagging.Core.Providers.DummyDiscoveryProvider,
Sitecore.ContentTagging.Core" />
        </discovery>
        <tagger>
            <add name="DefaultTagger" type="Sitecore.ContentTagging.Core.Providers.DefaultTagger,
Sitecore.ContentTagging.Core" />
        </tagger>
        <taxonomy>
            <add name="DefaultTaxonomyProvider" type="Sitecore.ContentTagging.Core.Providers.DefaultTaxonomyProvider,
Sitecore.ContentTagging.Core" />
        </taxonomy>
    </providers>
    <configurations>
        <config name="Default">
            <content>
                <provider name="DefaultContentProvider"/>
            </content>
            <tagger>
                <provider name="DefaultTagger"/>
            </tagger>
            <taxonomy>
                <provider name="DefaultTaxonomyProvider"/>
            </taxonomy>
            <discovery>
                <provider name="DefaultDiscoveryProvider"/>
            </discovery>
        </config>
    </configurations>
</contentTagging>

Video: Sitecore Cortex - Content Tagging Architecture

You can watch this video to see the customization and extension points included in the Sitecore Cortex content tagging feature. The video demonstrates how to configure new providers and configuration sets.