1. Disaster recovery

Disaster Recovery for containers

Version:

Disaster recovery allows an organization to maintain or quickly resume mission-critical functions following a disaster that requires manual intervention. The Sitecore Disaster Recovery (DR) service offers two DR service types: DR Basic (confirmation) and DR Managed (automatically). With these go two Infrastructure types: Cold Standby (minimum infrastructure) and Hot Standby (full replica):

  • DR Basic Cold Standby: reactive service. The recovery and failover process starts after the event and requires confirmation from the customer. This is a cost-effective service option with a longer Recovery Time Objective (RTO). The Basic Cold Standby DR service type provisions the minimum infrastructure required and is often used for non-critical applications or in situations where data only changes infrequently.

    Basic Cold Standby Disaster Recovery includes Geo-Replication. In the event of a disaster, failover to the secondary region and database happens with minimal downtime.

  • DR Managed Hot Standby: proactive service. The failback and failover process starts automatically and provisions an entire replica of the primary site. This provides the shortest RTO interval.

The recovery option that works best with your environment depends on whether you require failover initiation to be proactive or reactive and how quickly you want to be back online when an outage occurs.

For more information about disaster recovery for containers, see the Sitecore Managed Cloud Standard - Containers Disaster Recovery Knowledge Base article.

Infrastructure as Code

The Managed Cloud Containers environment uses Infrastructure as Code (IaC) where the provisioning artifacts are stored in Git repositories and follow GitOps best practices. Disaster Recovery also follows these practices.

For more information about the containers infrastructure, see The Managed Cloud architecture and Deploying in Managed Cloud.

Recovery options considerations

To decide which recovery option matches your requirements, use the following table as a reference and consider:

  • How quickly your site needs to be back online in the event of an outage.

  • The recovery point objective (RPO).

  • The recovery time objective (RTO).

Specifications

DR Basic Cold Standby

DR Managed Hot Standby

Backup technologies

Geo-Replication (ACR, SQL Server, KeyVault, Blob Storage)

Azure APIs

Geo-Replication (ACR, SQL Server, KeyVault, Blob Storage)

Azure APIs

Recovery process

  1. Customer request/approval for failover

  2. Deploy

  3. Switch over

  4. Go live

  5. Customer validation

  1. Switch over

  2. Go live

  3. Customer validation

Secondary environment state

Created on demand

Fully deployed exact replica of the primary environment up and running

Recovery Point Objective (RPO)

SQL 5 seconds

Applications rely on images in the ACR, and the RPO is therefore based on the latest images available in the ACR.

SQL 5 seconds

Applications rely on images in the ACR, and the RPO is therefore based on the latest images available in the ACR.

Recovery Time Objective (RTO) - Technology only

Manual failover execution takes approximately 90 minutes and includes secondary Sitecore provisioning, AFD Traffic switching, and SQL switchover.

Automated failover execution approximately 10 minutes and includes AFD Traffic switching, and SQL switchover.

Failback Time - Technology only

Manual failback execution takes approximately 15 minutes and includes AFD Traffic switching, and SQL switchover.

Automated failback execution takes approximately 10 minutes and includes AFD Traffic switching, and SQL switchover.

Note

The technology RTO values depend on how long the system takes to restore the Sitecore platform. If manual steps are required involving the customer or partner, this may extend the effective RTO.

Replication between regions

A typical Sitecore environment consists of the following Azure resource types:

  • KeyVault - Azure provides read-only replication to the secondary environment and this allows the continued operation of Sitecore in the secondary environment.

  • Storage Account - The object replication provided in the Storage account is used to selectively replicate storage account containers to the secondary environment.

  • ACR – ACR and its images are provisioned using terraform. The infrastructure code in Git contains the latest target state and is used for provisioning or maintaining the secondary state.

  • AKS – AKS is provisioned using terraform. The infrastructure code in Git contains the latest target state and is used for provisioning or maintaining the secondary state.

  • Azure SQL – Geo-Replication provided by Azure that ensures data replication.

If you have suggestions for improving this article, let us know!