Disaster Recovery for PaaS

DR Basic Cold Standby

Version:

This topic describes DR for PaaS. If you are are looking for DR for Containers, please refer to the Containers DR Basic Cold Standby topic instead.

With the Sitecore DR Basic Cold Standby, the Sitecore Managed Cloud disaster recovery service sets a process in to action in the event of an outage. When there is an outage in the primary region, a new Sitecore production environment is created in a secondary data center. During the creation of the secondary environment, a simple outage page is displayed to make customers aware that the site is down temporarily. Because the new environment must be created in a secondary data center, this recovery option has the longer RTO but is the less expensive option.

Set up

The setup steps are:

Provision the Control Resource Group and the relevant underlying resources and services that monitor DR states.
Set up the replication for SQL server.
Provision backup automation (covers the backup of webapps and synchronizes SQL server databases in the failover group).
Set up of the outage page.
Set up of the Traffic manager to redirect traffic and switch between primary CD and the outage page.
Set up of Email alerts to notify Managed Cloud operations team when availability tests fail.

Initiating a Failover

Sitecore Managed Cloud continuously checks the health of the primary region environment. If three out of five data centers report an issue, the Sitecore Managed Cloud operations team begins to investigate the Sitecore environment in the primary data center to see if there is a legitimate issue and not a false positive. The operations team performs the following validation checks in the primary data center:

Check for alerts raised by the Azure Resources used by the Sitecore site.
Check if the Traffic Manager is reporting a degraded endpoint.
Check the Azure Status site for known data center issues.

Should the Cloud Operations team determine that there is an unrecoverable issue in part or all of the underlying infrastructure in the primary data center, then the failover confirmation process begins, and the customer is contacted.

Failover/recovery confirmation

When the customer confirms, Sitecore triggers the recovery procedure using the following steps:

Deploy a new Sitecore environment in the secondary region.
Restore the WebApps from the last backup.
Switchover SQL server in GeoReplication.
Update the connection strings with the credentials from the primary instance.
Re-index content and xDB indexes.
Switch the Traffic Manager to the secondary Sitecore instance.

Failback

After Sitecore has finished the failover process and the cause of the disaster has been fixed, the customer and the Managed Cloud Operations team will agree on a time to return to the primary region environment. Because the data in the primary data center is now stale, the failover steps must be repeated from secondary to primary to bring the data up to date.

If you have suggestions for improving this article, let us know!