Disaster Recovery for containers

Containers DR Basic Cold Standby

Version:

With the Sitecore DR Basic Cold Standby, the Sitecore Managed Cloud disaster recovery service sets a process in to action in the event of an outage. When there is an outage in the primary region, a new Sitecore production environment is created in a secondary data center. During the creation of the secondary environment, a simple outage page is displayed to make customers aware that the site is down temporarily. Because the new environment must be created in a secondary data center, this recovery option has the longer RTO but is the less expensive option.

The following diagram shows the containers infrastructure before the DR Basic setup.

Setup

The setup steps are:

Provision the Control Resource Group and the relevant underlying resources and services that monitor DR states.
Configure Azure FrontDoor.
Set up the geo-replication for SQL server, ACR, and the Storage Account.
Update the Application Repository, specifically to recognize the endpoint for SQL Geo-Replication.

The following diagram shows the infrastructure state of the containers after performing the DR Basic setup.

Note

Sitecore executes the setup after the customer initiates a service request.

Initiating a failover

Sitecore Managed Cloud continuously checks the health of the primary region environment. If three out of five data centers report an issue, the Sitecore Managed Cloud operations team begins to investigate the Sitecore environment in the primary data center to see if there is a legitimate issue and not a false positive. The operations team performs the following validation checks in the primary data center:

Check for alerts raised by the Azure Resources used by the Sitecore site.
Check if the Traffic Manager is reporting a degraded endpoint.
Check the Azure Status site for known data center issues.

Should the Cloud Operations team determine that there is an unrecoverable issue in part or all of the underlying infrastructure in the primary data center, then the failover confirmation process begins, and the customer is contacted.

During a disaster, primary resources are not available and the Azure FrontDoor provides the users with an outage page. The following image shows the state during a disaster.

The following image shows the infrastructure of the containers after performing DR Basic failover.

Failover/recovery confirmation

When the customer confirms, Sitecore triggers the recovery procedure using the following steps:

Infrastructure provisioning in secondary.
Application provisioning in secondary.
Switch AFD to redirect traffic to secondary.

Failback

After Sitecore has finished the failover process and the cause of the disaster has been fixed, the customer and the Managed Cloud Operations team will agree on a time to return to the primary region environment. After a failback, the primary environment resumes from its state before the failure and SQL Server data is replicated from secondary region, if any.

If you have suggestions for improving this article, let us know!