Kubernetes is rapidly becoming the most popular container orchestrator in the modern IT environment. However, Kubernetes deployments are massive – extending up to several thousand containers. The large number of containers introduce several complexity and challenges in operations. The first step to manage Kubernetes effectively is to proactively detect errors, and respond effectively to issues. A set of monitoring metrics is required to keep track of the health of containers. It is important to understand the metrics and get visibility into the various nodes, containers, pods, and performance of the cluster.
Monitoring Kubernetes has several benefits apart from the most obvious benefit of improved reliability. Monitoring measures help understand the root cause of issues, which is not very easy to identify in a cloud-native architecture. Kubernetes monitoring enables informed decisions about hardware configurations allowing performance tuning. Constant monitoring helps in cost management by tracking node utilization, chargebacks, etc. Kubernetes monitoring can also provide crucial information about potential security breaches and detect threats.
There are two primary sets of metrics that help monitor Kubernetes:
- Kubernetes cluster and node metrics
- Kubernetes application metrics
Let’s look at each of them in some detail.
Kubernetes cluster and node metrics
It is important to understand the health and monitor the complete cluster. To do this, it is essential to know the amount of resources the entire cluster uses – the number of applications or services running at any given time, the number of containers, pods, nodes, and if the nodes are working properly and at the right capacity. The right monitoring solution should provide the pertinent insights & visibility into these metrics.
Kubernetes cluster metrics:
A Kubernetes monitoring solution should provide the following insights required for optimal awareness of Kubernetes clusters performance:
- The number of containers, pods, nodes
- Memory and disk usage
- Network bandwidth usage
The above are important metrics to track to understand capacity usage and adjust utilization plans accordingly. These metrics should be viewed separately at each level – node, pod, and container.
Kubernetes control plane metrics:
Scheduling decisions must be made effectively to ensure optimal performance of clusters. The control plane keeps track of various such metrics which can help make scheduling decisions. The right metrics should be monitored to ensure the components of the control panel are running efficiently.
The key metrics here are:
- Kubernetes API server calls
- The controller manager which shows the number of failed nodes
- The scheduler which allocates pods on available nodes
- etcd which decides the leader node in a cluster.
Relevant visualization tools such as Grafana dashboards can be used once monitoring metrics are set up. The insights can help diagnose any issues that may occur.
Kubernetes application metrics
Container metrics help determine if the configured threshold available to each container is approaching limits. Resource allocation is critical in ensuring there is no disruption to application performance. Pods should not be under or over-provisioned or stuck in a CrashLoopBackoff. Metrics can be set up to identify and troubleshoot these issues, and these metrics will track container CPU usage, container memory utilization, and network usage.
Application metrics measure the availability and performance of applications running inside Kubernetes by measuring Request Rate, Error Rate, and Duration (the RED metrics). In addition to RED metrics, metrics such as memory, heap, threads, and Java Virtual Machine (JVM) should also be set up.
Application availability metrics measure uptime and response time, a critical measure for user experience. Job failures and crash loops make the application unavailable. Monitoring the same can ensure availability of the application.
Application health & performance:
These metrics highlight the lack of responsiveness, latency, and all other issues that degrade the user experience in applications.
kube-state-metrics is a Kubernetes service that provides data on the state of cluster objects. There are several aspects regarding the state of the cluster such as persistent volume, disk pressure, crash loop, job failure, etc, that can be monitored using kube-state-metrics.
Monitoring persistent volume (PVs) can help gain visibility if storage is appropriately utilized and reclaimed if it is no longer required by the specific cluster that requests it. Disk pressure is a configurable threshold that helps monitor the volume of disk space used and the rate of utilization of disk space.
Cloud vendor Kubernetes monitoring tools
Various cloud services such as AWS, Azure, and Google Cloud provide several monitoring solutions to monitor Kubernetes in the cloud. Amazon Cloudwatch Container helps in detection, troubleshooting, and monitoring workloads of containers and microservice environments.
Container Insights is a feature in Microsoft Azure that monitors performance of containers in Azure cloud. Google Kubernetes Engine provides similar services for Google cloud.
There are numerous cloud monitoring services to choose from. It is recommended to start from using services such as Kubernetes Dashboards and then graduate to specialized monitoring tools based on specific needs. Some of the top Kubernetes monitoring tools are Container Advisor, Prometheus, DataDog, Kubewatch, Sematex, and Jaeger.
The metrics discussed above can help with continuous monitoring of the Kubernetes system. Teams responsible for maintenance of Kubernetes should gain access to the right metrics to reliably monitor its operations.
If you have questions related to this topic, feel free to book a meeting with one of our solutions experts, mail to firstname.lastname@example.org.