Disaster Recovery for PaaS 2.0

Disaster Recovery 2.0 considerations

Version:

This section highlights some of the things that you need to consider when building a Managed Cloud PaaS 2.0 Sitecore solution, with disaster recovery in mind.

Unsupported scenarios

A failover will not resolve failures that are due to the misconfiguration of a site or application code issues since the configuration is replicated exactly to the secondary site.

In addition, it's not possible to support disaster recovery if you are using P0 V3 App Service SKUs in Production. This is because the P0 V3 SKUs do not support zone redundancy, and also because they do not support scaling either to or from the P0 V3 tier. The Sitecore Disaster Recovery process includes tuned-down App Service plans, so we cannot enable the P0 V3 SKU in the secondary region.

There's also a small set of scenarios where restoring a production site into the Secondary Region might not be possible. Some examples of these are:

If there's an issue with a global Azure service such as authentication or Azure Front Door.
If both the primary and the secondary (backup) regions are down simultaneously.
If there's a large-scale global network failure or outage.
If the network connection between the Primary and Secondary Region is down.

Failback considerations

This section describes different scenarios that can occur after failover which might impact the ability to move onto the failback stage.

In all scenarios, it's important to regularly review and update your disaster recovery plan to adapt to changing circumstances and technologies. This helps ensure that your organization is well-prepared to handle unexpected disruptions and minimize the impact on your operations.

Primary region has recovered and Azure services are available without data loss or corruption

In this scenario, the primary region, which experienced a disaster or outage, has successfully recovered. As a result, all Azure services in the primary region are back online and functioning correctly. This is an ideal outcome as it ensures business continuity without any data loss or corruption.

Considerations:

Review the root cause analysis of the disaster or outage to understand what happened and prevent it from occurring again in the future.
Ensure that monitoring and alerting systems are in place to detect early signs of potential issues.

Primary region has recovered, and Azure services are available but the Azure service has sustained data loss or corruption

In this scenario, the primary region has recovered, and Azure services are back online. However, there has been data loss or data corruption. This is a less desirable outcome as it may lead to issues such as data inconsistency or incomplete transactions.

Considerations:

Identify the extent of data loss or corruption and assess its impact on the business operations.
Implement data recovery procedures to restore lost or corrupted data from backups or secondary sources. Sitecore Managed Cloud Includes the option to refresh the data from the DR region back to the Primary region when performing a failback.
Investigate the cause of data loss or corruption and take steps to prevent it from happening in the future.

Primary region has not recovered and an alternative DR region should be nominated

In this scenario, the primary region has not yet recovered, and it might be facing prolonged downtime or irreparable issues. As a result, it is necessary to designate an alternative disaster recovery (DR) region to ensure business continuity.

Considerations:

Determine the criteria for selecting the new DR region, which might include factors such as geographic proximity, redundancy of infrastructure, and compliance requirements.
Update your disaster recovery plan to reflect the new DR region, including the necessary failover procedures and data replication strategies.
Communicate the change in the DR strategy to relevant stakeholders and ensure they are aware of the new DR region and procedures.

Third-party service APIs

You are responsible for all third-party service APIs.

Sitecore connection strings

During the initial DR enablement process, Sitecore updates connection strings to ensure they are correctly pointed to the associated Disaster Recovery resources. This is necessary to ensure the failover from the Primary to the Secondary region is successful. However, whilst the DR process includes continuous replication of the SQL Database and periodic backups and restores of the App Services, the files noted in the table below are excluded from the ongoing App Service synchronization between the Primary and Secondary sites. As such, if any updates are made to these connection strings (for example, if you add new connection strings or credentials for custom applications), these changes must be manually applied to the DR site.

The following connection strings are excluded from the ongoing backup and restore processes:

String	File
"si"	"Config/production/Sitecore.IdentityServer.Host.xml" "sitecorehost.xml"
"cm"	"App_Config/ConnectionStrings.config" "App_Config/Sitecore/Azure/Sitecore.Xdb.Remote.Client.CM.config"
"cortex-processing"	App_Config/ConnectionStrings.config","App_Data/jobs/continuous/ProcessingEngine/App_Config/ConnectionStrings.config" "App_Config/AppSettings.config", "App_Data/jobs/continuous/ProcessingEngine/App_Config/AppSettings.config"
"cortex-reporting"	"App_Config/ConnectionStrings.config" "App_Config/AppSettings.config"
"ma-ops"	“App_Config/ConnectionStrings.config" "App_Data/jobs/continuous/AutomationEngine/App_Config/ConnectionStrings.config" "App_Config/AppSettings.config" "App_Data/jobs/continuous/AutomationEngine/App_Config/AppSettings.config”
"ma-rep"	"App_Config/ConnectionStrings.config" "App_Config/AppSettings.config"
"xc-collect"	"App_Config/ConnectionStrings.config" "App_Config/AppSettings.config"
"xc-refdata"	"App_Config/ConnectionStrings.config" "App_Config/AppSettings.config"
“xc-search"	"App_Config/ConnectionStrings.config" "App_Data/jobs/continuous/IndexWorker/App_Config/ConnectionStrings.config" "App_Config/AppSettings.config" "App_Data/jobs/continuous/IndexWorker/App_Config/AppSettings.config"

Outage page

Managed Cloud DR 2.0 includes a default outage page. This outage page is only be shown across public-facing endpoints during the failover process. You can change this page at your discretion.

Certificates

The creation of the Azure certificates is part of the Managed Cloud PaaS 2.0 provisioning pipeline (Spoke network provisioning). The same certificate must be used for both the Primary and Disaster Recovery Spokes.

xConnect Search Indexer

Sitecore can only have one active xConnect Search Indexer web job running across a solution. This means that the Production indexer must be shut down during failover and restoration of service, to remove the risk of it running at the same time as the indexer on the DR environment.

Certificates in Azure

A full trust chain certificate for the customer website domain must be provided and installed. Only one website certificate is currently supported with Managed Cloud DR. If you have multiple domains, use a wildcard certificate.

DNS Configuration

Specific DNS entries must be added in your public DNS provider for your website domain. The exact entries are environment/deployment driven and will be specific to your deployment. If these are not correctly applied, the site will not load.

If you have suggestions for improving this article, let us know!