Sitecore Experience Manager

The HADR hot-hot process

Abstract

Learn about the process involved in recovering your data with HADR hot-hot.

Sitecore Managed Cloud uses the Sitecore Disaster Recovery service to offer the following three options to maintain a high availability disaster recovery (HADR) service:

Just like the HADR hot-warm option, the HADR hot-hot option must always include a secondary data center that always has a fully furnished Sitecore environment set up. The Sitecore disaster recovery solution leverages Azure datacenter regions, so that if a disaster occurs in a region containing your production environment (the primary region), HADR hot-hot can recover your primary environment into another region, (the secondary region). Ensure that all your Azure resources match the sizes/instance counts in your primary data center. As a result, the hot-hot recovery option has a shorter recovery time objective (RTO), because everything is already prepared.

HADR_Hothot_Hotwarm.PNG

To set up your disaster recovery process:

  • Ensure you are running your Sitecore solution (9.1 or later), on Azure.

    Note

    Sitecore supports the following topologies: XM and XP, and the following deployment sizes: Extra Small, Small, Medium, Large, Extra Large.

  • Install PowerShell with an Azure SDK, version 6.0.0 or later.

  • Run your setup script and use the relevant modules.

  • Use PowerShell (AzureRM), to log in to Azure and select the relevant subscription.

  • Have your Sitecore license file ready.

  • Have any scripts that you want to develop for your HADR hot-hot scenario ready.

  • Use the Sitecore Azure Toolkit.

In the event of an outage, with the HADR hot-hot option, you must set into motion a recovery a process that is similar to the following:

  1. Enable recovery during a disaster by setting up the necessary environment before the disaster happens.

  2. Initiate the backup and replication process so that the Sitecore deployment and data are available for a healthy recovery.

  3. Deploy a passive Sitecore solution into the secondary region.

  4. Set up and enable the traffic manager to be the public gateway.

The following table outlines the different stages of the setup script process.

Stage

Actions

Stage 1

Prepare to name resources, execute modules, and upload files to be used.

You must construct resource names; prepare PowerShell modules, html files, and web certificate(s); and create your required DR-related modules that you will use in setting up and failing over.

Stage 2

Prepare to manage the resources in a specific subscription.

Log into Azure and select the subscription that you want to apply HADR hot-hot to.

Stage 3

Create another region that supports Azure Monitoring that is not currently affected by a disaster.

Create the following:

  • A control resource group in Azure, (so you can add your resources to this group).

  • An Azure key vault, (to manage secrets).

  • A storage account, (to store your backups).

  • Application Insights, (for troubleshooting).

  • A traffic manager, (for load balancing).

Stage 4

Use the Azure key vault (a service to manage and store secrets), to back up your certificate and connection strings, and for your secondary database.

Back up certificate(s) and connection strings by using a PowerShell script to fetch the web application certificate and connection strings from the web applications in the primary resource group, and then store them in the Azure key vault.

Stage 5

Monitor the health of the primary Content Delivery (CD) server, and set up notifications for when the CD server is not responding.

Set up availability tests to see if the CD server is not responding and add an availability test to App Insights.

Stage 6

Prepare the secondary database.

Provision Sitecore in the secondary database by deploying Sitecore in the secondary region.

Stage 7

Set up Geo-replication using a failover group where the replicated databases reside in the secondary region.

Create and enable Geo-replication.

Stage 8

Use hotfixes.

If you have any hotfixes, then deploy them to the folder on the primary and secondary servers.

If you are using Azure Search, then you must request a hotfix from Sitecore Support to enable indexing to work for the piped connection string that is for the content indexes.

Stage 9

Set up search replication.

Note

This only applies to Azure Search.

Update your Azure Search service connection strings to support a geo-replicated scenario.

Stage 10

Change your connection string to a read-write listener.

Update your database connection string configuration for the resources in your primary resource group to use a failover group read-write endpoint.

Update your database connection string configuration for resources in your secondary resource group to use a failover group read-write endpoint.

Use the ConnectionStrings.config file for Web Apps to update web jobs.

Stage 11

Create a configuration patch file for your secondary region.

Create configuration patch files for web applications in your secondary resource group to allow configuration value patching for configuration files that have been restored from your primary resource group.

Stage 12

Configure your traffic manager endpoints.

Set up your traffic manager with a primary CD web application and a secondary CD web application as endpoints. This is the priority routing method which is configured to conduct fast health checks on CD web applications at 10 second intervals. If the traffic manager fails to receive the response code 200 three times in a row, then the website is considered degraded and the traffic manager will perform a failover to the secondary endpoint.

Stage 13

Use a Web Apps backup scheduler to create a backup filter list (this excludes some configuration files from the backup), create storage containers for your backup, and trigger the web application backup.

Prepare a backup of your web applications that you want to recover into your secondary database so that the physical files, configuration files, and binaries will be the same as those on your primary database.  Schedule this to occur once a day.  

In the backup filter list (_backup.filter), define the configuration files that you do not want overwritten.

Stage 14

Use Azure Automation to run PowerShell at scheduled times.

  1. Create an automation account, a scheduler for the account, and a runbook script.

  2. Publish the runbook scripts and associate them with a schedule.

  3. Create an automation account service principle.

  4. Import all dependent modules into your storage account.

Note

There are different types of runbooks that you must create:

A Snapshot runbook, (used to record settings).

Scheduled to run every hour to fetch the size of the databases and web applications in the primary resource group, and then store it in the Azure key vault. The failover script uses this data for recovery. The service principle is required is to give specific access to the key vault for the runbook script.

A Synchronize runbook, (used to restore to the secondary region).

Scheduled to run every 3 hours to restore web applications to the secondary resource group, and scale resources in the secondary resource group based on the configuration backed up in the Snapshot runbook.

A StateManager runbook, (used to monitor performance and initiate failover/failback).

Scheduled to run every 2 hours. Each run monitors the traffic manager continuously for 119 minutes and performs a failover or failback of resources such as the database, caches, and services. For more information, see the Initiating failover and failback section of this topic.

If you are using the HADR hot-hot solution, Sitecore recommends the following:

  • After you have set up your disaster recovery solution, wait 24 hours before initiating failover recovery, as your geo-backup may not yet be available for recovery into the secondary region.

  • Scale up the database in your secondary region before you scale up the database in your primary region, to avoid causing an error. This is because your secondary geo-replicated database cannot be smaller than your primary database.

  • Refer to the following table of paired regions with Azure Search for your disaster recovery strategy.

    Geography

    Paired region

    Paired region

    Details

    Asia

    East Asia

    Southeast Asia

    East Asia does not support App Insights.

    Brazil

    Brazil South

    South Central US

    Brazil South does not support App Insights.

    Europe

    North Europe (Ireland)

    West Europe (Netherlands)

    Europe supports App Insights.

    Japan

    Japan East

    Japan West

    Japan West does not support App Insights.

    North America

    East US

    West US

    Japan West does not support App Insights.

    North America

    North Central US

    South Central US

    North Central US does not support App Insights.

    North America

    West US 2

    West Central US

    West Central US does not support App Insights.

The traffic manager must perform health checks on the primary CD web application and the secondary primary CD web application every 10 seconds. If the primary endpoint fails to return response code 200 three times in a row, then the traffic manager will perform a failover to the secondary endpoint.

The StateManager runbook job within the automation account of the control resource group will perform constant monitoring of the traffic manager primary endpoint, to ensure resources such as the database, caches, and the services that serve the CD website also failover. Whenever the traffic manager is initiating a failover/failback process the StateManager job picks up and executes relevant steps. For more information, see: Stage 5 of the following Failover table.

StateManager runbook jobs are scheduled to run every 2 hours. Each run performs constant monitoring of the traffic manager continuously for 119 minutes.

Failover

Stage

Description

Stage 1

Stop the automation account.

Stops the Snapshot and Synchronize runbook jobs in the automation Account. These runbooks are responsible for performing backups of tier configurations, web applications in the primary resource group, and restoring the secondary resource group.

The web application data restoration only runs in one direction (from the original primary database to the secondary database), therefore it is no longer required to run after a primary resource group undergoes a disaster.

Stage 2

Failover the database geo-replication.

Performs a failover on the database failover group.

Stage 3

Update the shard databases.

Updates the shard map in the shard tables, and the shard management shard table to match geo-replicated database behavior.

Stage 4

Perform index rebuilding.

Rebuilds the index in the secondary region. Initiates index rebuilding of items based on data in the SQL databases, (this does not apply to the XM topology).

Stage 5

Stop the Content Management (CM) web application (Content Freezing) and Index Worker,

The CM web application is stopped for content freezing and the Index Worker web job in the xc-search web application is also stopped to avoid indexing.

Failback

Stage

Description

Stage 1

Failover database geo-replication.

Performs a failback on the database failover group.

Stage 2

Update the shard databases.

Updates the shard map in the shard tables, and the shard management shard table to match the geo-replicated database behavior.

Stage 3

Perform index rebuilding.

Rebuilds the index in the primary region. Initiating index rebuilding of items, based on data in SQL databases, (this does not apply to the XM topology).

Stage 4

Restart the CM web application and Index Worker.

Turns on the CM web application and Index Worker web job that was stopped previously during failover.

Stage 5

Restart the automation account.

Resumes the runbook jobs in the automation account, that were stopped previously during failover.