1. Managed Cloud Containers

Monitoring Managed Cloud Containers

Version:

You can monitor the performance and availability of your Managed Cloud Containers solution by using the built-in monitoring services:

  • Metrics exporters - libraries that help to export metrics from services and infrastructure to an existing Prometheus server.
  • Prometheus - scrapes metrics from services, aggregates and stores data, and allows other services such as Grafana to collect such metrics.
  • Grafana - collects metrics from Prometheus and visualizes them.
Monitoring services in Managed Cloud Containers overview.

Authentication in Grafana

Grafana is integrated with the Azure Active directory and Basic Authentication is disabled. Therefore, you must choose the Sign in with Microsoft authentication option and use your Microsoft work account.

Login to Grafana with Microsoft account

Dashboards

You can search for dashboards by the dashboard name, filtered by one (or many) tags or filtered by starred status. You can access dashboard search through the dashboard picker, available in the dashboard top navigation area. You can also open dashboard search by using the shortcut F.

Dashboard search in Grafana

The following default dashboards are available:

DashboardDescription
Container overviewLists all containers with their namespace and pod. Provides the status of each container and the total number of healthy/unhealthy and/or stopped containers.
Host Disk Overview (Linux only)Exposes node filesystem and disk I/O metrics such as read-write time spent, the filesystem available space, and so on.
Host Disk Overview (Windows only)States the filesystem available space.
Ingress OverviewProvides the Ingress metrics for each Sitecore role and Grafana.
Kubernetes ClusterProvides a high-level overview of the Kubernetes cluster.
Kubernetes Pod OverviewExposes memory and CPU request, limits, and utilization per pod for all namespaces including system. It provides live logs.
Linux Node OverviewProvides detailed information about Memory/CPU/Disk utilization for each Linux Node.
MsSql Elastic PoolProvides detailed information about MsSql Elastic Pool utilization.
Redis Server OverviewExposes general Redis metrics. Similar to native Redis "INFO" command.
Windows Node OverviewProvides detailed information about Memory/CPU/Disk utilization for each Windows node.

Alerts

Alerts proactively notify the MCP team when issues are found with your Managed Cloud solution. They allow you to identify and address issues before the users of your system notice them.

The following table lists the available alerts:

DescriptionConditionResourcePeriod
Node statisticMemory percentage is >85%The node memory utilization percentage is more than 85%.Kubernetes node10 minutes
CPU percentage of Linux node is >85%The CPU load percentage of Linux node is more than 85%.Kubernetes node10 minutes
CPU percentage of Windows node is >85%The CPU load percentage of Windows node is more than 85%.10 minutes
InfrastructurePod is not ready for 30mPod status != readyKubernetes pod30 minutes
Kubelet is downThe kubelet job is down for the last 15 minutes.Kubernetes job15 minutes
Pod is restarting frequentlyThe Pod is restarted at least once per 5 minutes.Kubernetes pod1 hour
Deployment generation mismatchThe deployment has failed but has not been rolled back.Kubernetes deployment15 minutes
Deployment replicas mismatchDeployment has not matched the expected number of replicas for longer than an hour.Kubernetes deployment1 hour
DaemonSet pods not readyNot all of the desired pods are scheduled and ready.Kubernetes daemonset15 minutes
DaemonSet pods not scheduledNot all of the desired pods are scheduled.Kubernetes daemonset10 minutes
DaemonSet pods misscheduledPods of DaemonSet are running where they are not supposed to run.Kubernetes daemonset1 hour
Warning events occurredOne or more events of type Warning occurred in namespace.Kubernetes namespace1 hour
Node is not readyThe node is not ready.Kubernetes node1 hour
Kubernetes version mismatchThere are different semantic versions of Kubernetes components running.Kubernetes1 hour
Kubernetes API server client is experiencing errorsMore than one error in the Kubernetes API server.Kubernetes5 minutes
Node is running out of pods capacityThe node pods capacity is more than 95%.Kubernetes node15 minutes
Disk space is used for >90%The node disk space is used for more than 90%.Kubernetes node1 hour
Linux node pool reboot requiredLinux node reboot requiredKubernetes node
PrometheusPrometheus PersistentVolume available spacePrometheus PersistentVolume space is used for > 90%Prometheus1 hour
Sitecore rolesHttp request is 5xx >105xx http response is more than 10.nginx_ingress_controller10 minutes
Average page response time >5 – set by default instead of 1 secondThe average response time is more than 1 second.nginx_ingress_controller30 minutes
Average page response time >30 secondsThe average response time is more than 30 seconds.nginx_ingress_controller5 minutes
Availability tests are on /sitecore/service/keepalive.aspxThe availability tests on /sitecore/service/keepalive.aspx failed.Sitecore pod3 minutes
Redis cacheAverage number of connected clients in % are >80%The number of connected clients is more than 80% compared to redis_config_maxclients.Redis Cache30 minutes
The server load is >95%The processor load percentage for Redis is more than 95% over the last 30 minutes.Redis Cache30 minutes
MSSQL elastic poolDatabase throughput unit (vCores) is >95%More than 95% during last 5 mins.MSSQL Elastic Pool5 minutes
Storage percentage is >75%More than 75% for the last 5 min.MSSQL Elastic Pool5 minutes
CPU is >90%CPU usage is more than 90% for the last 15 minsMSSQL Elastic Pool15 minutes
SQL Databases DeadlockThe database is deadlocked.MSSQL Elastic Pool
Data IO percentage is >90%More than 90% of Data IO load during the last 15 minMSSQL Elastic Pool15 minutes
Log IO percentage is >90%More than 90% of Log IO load during the last 15 minsMSSQL Elastic Pool15 minutes
Workers percentage is >90%More than 90% Worker load during the last 15 minsMSSQL Elastic Pool15 minutes
Concurrent sessions supported by the DB tier is >90%Number of allowed concurrent sessions has reached 90% of its limit during the last 15 minsMSSQL Elastic Pool15 minutes
Number of failed database connections >5More than 5 failure db connections over the last 5 minsMSSQL Elastic Pool5 minutes
Average In-Memory OLTP storage >95%More than 95% of Average In-Memory OLTP storage usage over 30 mins.MSSQL Elastic Pool30 minutes
ElasticSearchElasticSearch cluster is in yellow stateElasticSearch cluster is in yellow stateElasticSearch15 minutes
ElasticSearch cluster is in red stateElasticSearch cluster is in red stateElasticSearch15 minutes
ElasticSearch cluster JVM is overloadedElasticSearch cluster JVM is more than 75% capacityElasticSearch5 minutes
Elasticsearch disk space lowThe disk usage is over 80%ElasticSearch0 minutes
Elasticsearch disk out of spaceThe disk usage is over 90%ElasticSearch0 minutes
If you have suggestions for improving this article, let us know!