Processing and aggregation

Abstract

Overview of live and historical data aggregation in Sitecore using distributed processing.

The Content Delivery (CD) role collects experience data that is stored in Sitecore Experience Database (xDB), where it is also processed in near real-time for use in Analytics and Reporting and Marketing Automation.

xDB processes data as it is submitted or on demand. Processing can continually aggregate the collected data and make it available for actionable insights or reporting through Experience Analytics, or to external business intelligence tools.

xConnect Collection exposes a single endpoint where internal and external trusted systems can plug in and react to data being collected or updated about a contact. For example, to ensure that a Sitecore solution is GDPR- or privacy-compliant, systems can execute the right to be forgotten operation and any plugin in xConnect can respond to this and notify the surrounding systems to forget the contact.

The normal process of aggregation.

The concept of plug-ins is also used for aggregating the data as it is being collected. This is called live aggregation.

For example, when a Content Delivery session ends, it submits an interaction to the xConnect Collection role and the live aggregation plug-in in xConnect reacts. The plug-in stores a record in the xDB Processing Pools database and relays information to the xDB Processing application about how to handle the new interaction.

The process of live aggregation.

xDB Processing continuously polls the xDB Processing Pools database and pulls the recently added aggregation task and starts the aggregation process.

During the processing, xDB Processing will pull the new interaction from the xConnect Collection role as well as additional data needed in the aggregation from other sources, for example, the Reference Data service.

Finally, when the aggregation is done, the resulting data is stored in the xDB Reporting database.

After you deploy a new Sitecore version, or if you extend your solution with new reporting dimensions or datasets, you must reprocess all of the interactions in the xDB. This process is called historical aggregation.

To enable historical aggregation, you must set up an additional secondary xDB Reporting database.

The role of the secondary reporting database.

When you attach a secondary xDB Reporting database to the xDB Processing role, both the primary and secondary xDB Reporting database stores all live aggregation data.

Note

You must not add a secondary xDB Reporting database unless you plan to run historical re-aggregation. This is because it requires the system to write to both the primary and secondary xDB Reporting databases and increases the overall load on the system.

An administrator can begin the historical re-aggregation process through the Sitecore administrative interface. The Content Management (CM) role then triggers the processing operation on the xDB Processing role.

This initially erases all the data in the secondary xDB Reporting database and creates an historical re-aggregation task in the xDB Processing Tasks database. Subsequently, the xDB Processing role extracts data to get an enumerator for the entire set of interactions in the xDB Collection database.

The process of historical aggregation.

If the aggregation of a single interaction fails, it is added to the xDB Processing Pools database and the aggregation is retried at the end of the historical re-aggregation process.

If any new interactions come in through the xConnect Collection role, the live aggregation process writes them to both the primary and secondary databases. This ensures that the new interactions that are submitted to xConnect during the historical aggregation process are not lost.

Note

Processing all historic interactions in the Sitecore Experience Database can be very heavy on resources. To avoid system strain, you can scale the xDB Processing role horizontally to split the aggregation task across multiple servers and threads.

The initiating xDB Processing role splits the dataset in the xDB Processing Tasks database into several parts and assigns each part of the dataset (also called a cursor) to a processing worker or thread on each xDB Processing role.

Each processing worker retrieves interactions data from the xConnect Collection role and pulls additional data needed in the aggregation from other sources, for example, the Reference Data service.

When aggregation completes, the secondary xDB Reporting database contains the newly aggregated data. However, the primary xDB Reporting database always serves the reporting, insights, and analytics applications, so a system administrator must switch the secondary and primary xDB Reporting database to ensure the new data is live. System administrators can manually switch the databases, typically by updating the connection strings.

The last type of processing handled by the xDB Processing role is distributed processing.

Distributed processing allows systems to schedule xDB data processing tasks and distribute them to other databases. For example, the Path Analyzer uses distributed processing operations to process interactions and store aggregated traffic maps.

You can queue distributed processing operations using the xDB Processing API, for example, through a scheduled task that runs on the xDB Processing role.

When a distributed processing task is triggered, the xDB Processing Tasks database creates a task record. The xDB Processing role performs data extraction to get an enumerator for the desired set of entities, such as interactions or contacts, in the xDB Collection database. You can limit the dataset based on a time range. The dataset is then split up into parts (or cursors) in the xDB Processing Tasks database. There is one cursor per thread for each xDB Processing role.

The process of distributed processing.

Custom logic runs on all xDB Processing roles and processes the entities. During processing, the custom processing logic continually sends the processed data for storage or handling in other systems. For example, the specific aggregated data for the Path Analyzer is stored in separate tables in the xDB Reporting database.

Processed data is continually sent for storage or handling in other systems.

Refer to the Architecture and roles documentation for privacy and security considerations for each role on the processing and aggregation data flow.