Accessing your data
This walkthrough describes how to access your organization's data in the Sitecore Amazon S3 bucket using the Amazon Web Services Command Line Interface (AWS CLI).
This walkthrough assumes that you have:
-
An Amazon Web Services (AWS) account with access to the AWS Management Console and permission to create an IAM role.
-
The AWS Command Line Interface (AWS CLI), configured to access your AWS instance using the IAM role.
To prepare to access your data, you first create an IAM role and update its policy. Next, create a support case to request access to your data by authorizing the IAM role. After the IAM role is authorized, you can use the IAM role in the AWS CLI to securely access the data.
This walkthrough describes how to:
Create an IAM role
You can use the AWS Management Console to create an IAM role that will grant you, as the creator, exclusive read access to your organization's data.
To create an IAM role:
-
In the AWS Management Console, create an IAM role that will be authorized to access your organization's data in the Sitecore Amazon S3 bucket.
-
Make a note of the IAM role Amazon Resource Name (ARN). Replace
<aws_account_id>
with your AWS account ID and<role_name_with_path>
with a valid path.
The IAM role you created grants exclusive read access only to you, the original user who created it. When requesting access to your organization's data, you must provide the specific ARN associated with this role.
Configure the IAM role policy
After you create the IAM role, you must attach a permission policy to it. The permissions in this policy determine whether your request to access your organization's data is allowed or denied.
To configure the IAM role policy:
-
In the AWS Management Console, in the access management area for the IAM role you created in the previous procedure, create the following inline policy:
RequestResponse{ "Version": "2012-10-17", "Statement": [ { "Sid": "AllowS3Access", "Effect": "Allow", "Action": [ "s3:GetObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::bx-<client_key>-production-<region_code>/*", "arn:aws:s3:::bx-<client_key>-production-<region_code>" ] } ] }
Replace the placeholder values with details from your Sitecore CDP instance.
Request access
After you configure the IAM role policy, you must request access to your organization's data by creating a support case to enable the data lake export.
To request access:
-
Create a support case and provide your IAM role ARN.
-
Wait for confirmation. You will be notified once Sitecore has enabled the data lake export, granting access to the created IAM role. Only the specific IAM role ARN you provided will be authorized to access your organization's data.
Do not modify the IAM role after the data lake export has been enabled, as any changes will disrupt access. The Sitecore Amazon S3 bucket is strictly configured to recognize only the original IAM role ARN that you provided. Changing the permissions or details of the IAM role will result in conflicts, preventing you from accessing your organization's data.
Understand the exported data
Before accessing your organization's data, it's important to understand where it's stored and what data is included in the export.
Data storage location
After access is granted, the Sitecore CDP data lake export service runs daily, creating a full export of your organization's data. This data exported is stored in a designated folder in the Sitecore Amazon S3 bucket. The export folder follows this format (with placeholder values replaced with details from your Sitecore CDP instance):
Data redundancy
Sitecore provides a level of designed redundancy in the export folder in case of failures or errors in the export process. To ensure data reliability, Sitecore stores the last three days of full data lake exports in the designated folder.
For example, on May 5th 2024, you'll find three subfolders in the export folder, each labeled with a date following the YYYY-MM-DD
ISO 8601 format and contains the following data:
-
2024-05-04
- the entire contents of the data lake, including new and updated data from May 3rd. -
2024-05-03
- the entire contents of the data lake, including new and updated data from May 2nd. -
2024-05-02
- the entire contents of the data lake.
If a failure occurs on May 6th, the previous three days' worth of data will still be available, but the May 5th data will not be there. When the service resumes on May 7th, the folder structure will look like this:
-
2024-05-06
- the entire contents of the data lake, including new and updated data from May 4th and May 5th. -
2024-05-04
- the entire contents of the data lake, including new and updated data from May 3rd. -
2024-05-03
- the entire contents of the data lake.
Data partitioning
The export folder is further partitioned into subfolders for different Sitecore CDP entities. You'll find separate folders for events
, guests
, and sessions
.
The events
subfolder contains the largest dataset in the data lake, and downloading this data daily can be inefficient. To optimize this process, Sitecore uses date-based partitioning for the events data. Each day's events are stored in a separate folder labeled meta_created_at_date=YYYY-MM-DD
. For example: meta_created_at_date=2024-05-04
, meta_created_at_date=2024-05-03
, and so on.
The events data is additive, meaning previous events are not deleted. As a result, the events
subfolder in each daily export contains partitioned subfolders for each day that the data lake export has been active. This structure makes it unnecessary to download the entire events
subfolder every day. Instead, you only need to pull the latest day's partitioned events data and add it to your existing dataset to keep it updated.
The sessions
subfolder is partitioned by date in the the same way as the events
subfolder, and it is recommended to follow the same approach.
The guests
subfolder is partitioned by guest type: CUSTOMER
, VISITOR
, RETIRED
. Since this dataset is constantly changing, it is recommended that you take a full pull of this folder each day.
Access your data
After the IAM role is authorized, only you, the creator of the IAM role, will be able to access your organization's data in the Sitecore Amazon S3 bucket.
Only the specific IAM role ARN you provided is authorized to access your organization's data. No other users can access your data using this IAM role. If you grant the role to a different user than originally specified, and they attempt to carry out the export process, they will be denied for security reasons.
If you need to change the assigned user after the data lake export has been set up, you must create a support case to request this update. This will reset the process, so make sure you have the correct user, role, and ARN before requesting access.
This section describes how to use aws s3 cp
(or copy) commands in the AWS Command Line Interface (AWS CLI) to download your data or copy it to another Amazon S3 bucket of your choice. Alternatively, you can also perform any Amazon S3 action that starts with Get or List to access data.
To access your data:
-
Make sure you have AWS CLI installed and configured to access your AWS instance using the IAM role.
-
Determine which folders and subfolders you want to copy from the export folder. You can select different sets of data depending on your specific requirements.
-
Open a terminal or command prompt and run the following
aws s3 cp
commands to copy your organization's data. Make sure to replace the placeholder values with details from your Sitecore CDP instance:Example 61. Frequently used aws s3 cp commands you can enter in your terminalDownload your organization's data to your local machine. This includes the last three days of full data lake exports.
RequestResponseaws s3 cp s3://bx-<client_key>-<env>-<region_code>/analytics/bdl/exports/data . --recursive
Download a full data lake export for a specific date to your local machine.
RequestResponseaws s3 cp s3://bx-<client_key>-<env>-<region_code>/analytics/bdl/exports/data/<date> . --recursive
Copy all your organization's data to another Amazon S3 bucket of your choice. This includes the last three days of full data lake exports.
RequestResponseaws s3 cp s3://bx-<client_key>-<env>-<region_code>/analytics/bdl/exports/data <destination> --recursive
Copy a full data lake export for a specific date to another Amazon S3 bucket of your choice:
RequestResponseaws s3 cp s3://bx-<client_key>-<env>-<region_code>/analytics/bdl/exports/data/<date> <destination> --recursive
After you run one of these commands, your organization's data is either downloaded locally or copied to another Amazon S3 bucket of your choice.
Reference for placeholder values
In the example commands, replace the placeholder values with the required details from your Sitecore CDP instance and with export details depending on your specific needs.
Attribute |
Type |
Description |
Example |
---|---|---|---|
|
string |
Your Sitecore CDP client key from your Sitecore CDP instance. This is your organization's unique and public identifier. To find your client key, in Sitecore CDP, on the navigation pane, click |
|
|
string |
The deployment environment. Typically set to |
|
|
string |
The region code corresponding to your Sitecore CDP instance's environment. To find the region code, in Sitecore CDP, on the navigation pane, click |
Must be one of:
|
|
string |
A specific date in the past or today's date to copy a full data lake export for that date. Format: |
|
|
string |
Your local machine denoted by a period ( |
|