High availability

High availability (HA) is defined as the ability of a system or system component to be continuously operational for a desirably long length of time. In Managed Cloud this is achieved by introducing redundancy and the high availability of infrastructure is guaranteed by the Azure provider. Within the Managed Cloud containers configuration we target the availability of our infrastructure components as 99.99%. High availability is provided within the same location only and does not cover scenarios where an entire region fails. Entire region failures must be addressed by the Disaster Recovery scenario. 

High availability is implemented with the support of the service vendors' built-in features.

Azure Kubernetes Service (AKS) Uptime SLA

Although AKS does not provide any SLA(s) with the default configuration, as it’s free, Microsoft does endeavour to provide a 99.5% uptime with the default configuration.

Uptime SLA guarantees 99.95% availability of the Kubernetes API server endpoint for clusters that use Availability Zones and 99.9% of availability for clusters that don't use Availability Zones. AKS uses master node replicas across update and fault domains to ensure SLA requirements are met.

Read here for more on the Azure Kubernetes Service (AKS) with Uptime SLA.

AKS Workload Availability Zones

The Managed Cloud deployment model, when using availability zones, ensures nodes in a given availability zone are physically separated from those defined in another availability zone. AKS clusters deployed with multiple availability zones configured across a cluster, provide a higher level of availability to protect against a hardware failure or a planned maintenance event. 

For more details, read Use availability zones in Azure Kubernetes Service (AKS) - Azure Kubernetes Service.

Known limitations:     

  • You can only define availability zones when the cluster or node pool is created.·      

  • Availability zone settings can't be updated after the cluster is created. You also can't update an existing, non-availability zone cluster to use availability zones.·      

  • The chosen node size (VM SKU) selected must be available across all availability zones selected.·      

  • Clusters with availability zones enabled require the use of Azure Standard Load Balancers for distribution across zones. This load balancer type can only be defined at cluster create time. For more information and the limitations of the standard load balancer, see Azure load balancer standard SKU limitations.·      

  • Additional price can be applied for data transfer – See Pricing – Bandwidth | Microsoft Azure   

See also Data transfer between Availability Zones(Egress and Ingress) is payable.

SQL Elastic Pool

Azure SQL Database is a fully managed relational database with built-in regional high availability.

The Azure SQL Managed Instance has an availability guarantee of at least 99.99%. This applies to both the Business Critical tier and the General Purpose tiers. There are three service tiers:

  1. General Purpose/Standard—for common workloads

  2. Business Critical/Premium—for high throughput OLTP applications requiring low latency and high resilience

  3. Hyperscale—for very large OLTP systems, performs auto-scaling of storage and compute:

    1. Azure SQL Database Business Critical or Premium tiers configured as Zone Redundant Deployments have an availability guarantee of at least 99.995%.

    2. Azure SQL Database Business Critical or Premium tiers not configured for Zone Redundant Deployments have an availability guarantee of at least 99.99%.

    3. Azure SQL Database General Purpose, Standard, Basic tiers, or Hyperscale tier with two or more replicas have an availability guarantee of at least 99.99%.

    4. Azure SQL Database Hyperscale tier with one replica has an availability guarantee of at least 99.95% and 99.9% for zero replicas.

    5. Azure SQL Database Business Critical tier configured with geo-replication has a guarantee of Recovery point objective (RPO) of 5 sec for 100% of deployed hours.

    6. Azure SQL Database Business Critical tier configured with geo-replication has a guarantee of Recovery time objective (RTO) of 30 sec for 100% of deployed hours.

Search Stax

High Availability is built-in. The uptime depends on the corresponding tier from 99.5 (Gold) to 99.95 (Platinum Plus).

Read more for details on Managed Solr Pricing and Features.

Front Door

Azure guarantees that at least 99.99% of the time Azure Front Door Service will respond to client requests and deliver the requested content without error.

Azure Container Registry(ACR)

We guarantee that at least 99.9% of the time Managed Registry will successfully process Registry Transactions. The SLA for Classic Registry is provided through Azure Storage.

If you are using a public image, consider importing it into your container registry that aligns with your SLO. Otherwise, the image might be subject to unexpected availability issues. Those issues can cause operational issues if the image isn't available when you need it.

See the SLA for Container Registry from Azure.

Decisions

High availability decisions are applicable for Production configuration.

Table 1. High availability solutions and potential target SLA

Resource

Solution

Potential target SLA

Comments

AKS

Enable uptime SLA feature by default for production deployments

99.95% in pair with enabled Availability Zones for Workload

Review list of supported locations.

Windows Node Pool

  • Configure 2 Availability zones

  • Get rid of 2nd node pool - use one node pool

  • Scale Set should be configured at least with 2 Nodes

99.99% with configured Availability Zones

Review list of public supported locations where availability zones are supported.

Linux Node Pool

  • Configure 2 Availability zones

  • Scale Set should be configured at least with 2 Nodes

99.99% with configured Availability Zones

Review list of public supported locations where availability zones are supported.

SQL Elastic Pool

General Purpose tier

99.99%

Search Stax

Platinum tier

99.9%

Front Door

High Availability provided by default

99.99%

ACR

99.9%

Pull all (sitecore + 3rd party) images locally during the provisioning time

Storage Account

LRS replication

99.999999999% (11 nines)

Kubernetes Workload

Roles that can be scaled horizontally should be configured at least with 2 pods