Kubernetes is the most popular container orchestrator to run cloud-native applications. It acts as a bridge between traditional Virtual Machines and serverless technology. Developers learn to build, package, and actually operate their applications. Security experts help to secure them. These disciplines form the heart of a DevSecOps team. Despite all of the good efforts, things can go wrong. Debugging a Kubernetes-based workload can be tricky since there are so many moving parts. Besides this, there are multiple layers: the raw infrastructure layer (f.e. worker nodes and networking), the platform layer (f.e. Rancher or EKS), and the application layer itself (f.e. the Pods, Daemonsets or Services). You need to analyze your workloads in case of problems. Useful tips and tools to debug your Kubernetes workloads.
Things that can go wrong
Before we can answer the main question “how to debug Kubernetes workloads”, it’s good to understand what can go wrong in a Kubernetes cluster. The short answer is a lot. When digging a bit deeper into the system, there are multiple problem areas that can cause problems. Just a few to give you an impression.
- Networking issues (even if the container network is maintained by a cloud provider) such as incorrectly configured Ingress / Egress controllers. Or wrongly configured Network Security Policies (too strict so you can’t access your application(s) or too loose so you would see way too many traffic logs). Besides, your networking layer (which is also based on container-based components) can contain bugs and/or vulnerabilities. These need to be fixed/patched, but only after you have debugged the current situation.
- Probes (both readiness and liveness) both need to work properly to determine if and when your application (Pod) is ready to receive traffic.
- Resource limits. If not set to appropriate values, affect the stability of your Pods as well as to the application it hosts. On top of that, it also affects other Pods (think of service interruptions or problems when recovering from severe problems). Even your worker nodes can get out of service. High CPU or Memory values prevent your entire cluster from functioning correctly. Unexpected behavior can be the cause and random failures can be the result.
- Permission problems with Role-Based Access Control. Sometimes you need to correct your permissions if you have RBAC enabled. A Service Account might have few too permissions to do its job and thus your application is failing.
Besides the Kubernetes-based issues, your application itself can be the root cause of all evil. You also need ways to troubleshoot using sophisticated matters.
Learn the command-line
First of all, it’s extremely important to completely focus on the command line since all tools depend on it. You can use a Linux, Windows or MacOS-based developer environment. Whatever you do, you need to fetch information from running services as well as log files, (cloud) APIs, and other sources. After you fetch the information, you need to process it. Filter the (massive) amount of results, aggregate it and create information based on the data. All of these actions require command-line tools since you need to automate as much as possible to become ready for future incidents. This is not possible in GUI-based tools, so these are a no-go.
Kubectl
Kubernetes ships with Kubectl, the command-line tool to operate your cluster. Kubectl is a binary that runs on Windows, macOS and Linux. When properly authenticated to the cluster, it communicates with the Kube API server to fire commands to the cluster itself. In essence, you can control the entire cluster with the commands that are available. The following commands help to troubleshoot problematic workloads that require your intervention.
- Kubectl describe: this command describes the structure and status of a Pod, Service, or Deployment. It’s also possible to gather information from your worker or master node to identify problems such as low disk space, memory consumption, etc. Output can be stored in plain text format or YAML to be processed further by tools like JQ.
- logs: log the contents of a Kubernetes resource. By default, containers log to stdout and stderr. With Kubectl log you can fetch the information from these log streams.
- Kubectl exec: a powerful subcommand to execute a command in a running container. It gives the troubleshooter a possibility to see what’s going on in the container (such as viewing the actual configuration, startup scripts, or file/directory permissions). However, keep in mind that you should NEVER change a running container. This violates the principle of “immutable infrastructure”. Security tools also trigger an alert if they detect runtime changes to containers which do not stem from a proper deployment mechanism.
- Auth: this is a very powerful subcommand to validate the authorization of a user or group for a certain action. It’s also possible to validate the permissions of a Service Account. For example: validate if the Service Account can actually create or delete Deployments and Services. The basic syntax is in the following style: “kubectl auth can-i …”.
Running containers
- Use Kubectl debug to check what is going on in your running container. It’s pretty useful if you cannot access a shell inside the container (when using Kubectl exec). With Kubectl debug it’s also possible to create a copy of the container you want to inspect. This way, you can safely investigate it, without modifying the original resource and thus have no impact on the running application.
- Attach. Use Kubectl attach to connect to a running container. It resembles Kubectl exec, but the main difference is that Kubectl attaches hooks into the main process of the container whereas Kubectl exec allows you to view any process in the container. In essence Kubectl, exec can replace attach, but there are some use-cases in which people want to send/stream data to the main process running in the container, so this still justifies the Kubectl attach feature.
As you can see, you can already do a lot with Kubectl. It’s not only nice from a developer perspective, but also from a cluster operator perspective. People that maintain Kubernetes clusters use it frequently to also troubleshoot issues on a cluster level. Use Kubectl top to display resource-based information such as memory or CPU usage of Pods and worker nodes. Or use Kubectl cluster-info to fetch all available information from the cluster itself.
Other debugging methods
So far so good, but what should you do if you do not have access to Kubectl at all? In developer environments, you might have access to your own application resources. Or you have complete control over your Kubernetes environment (in the case of a Platform operator). This might not be the case for Test, Acceptance, and Production environments. What should you do to troubleshoot resources in these environments? There are a number of options which are outlined below. For the sake of simplicity, only tools you can install and operate yourself are mentioned.
Telepresence
Kubernetes is powered through a lot of communities that contribute in some way or the other to the development of Kubernetes itself. The same is true for tools like Telepresence. This is a tool to intercept your target Kubernetes environment and redirect traffic to your local counterpart. It helps to debug traffic-related problems, and issues with Pods, containers, and other Kubernetes-based resources. Telepresence works with Kubectl and OC (CLI utility for the OpenShift container platform).
Besides these options, it’s also possible to hook into your local IDE such as Visual Studio Code or IntelliJ. Just like other programming languages, it’s now possible to add breakpoints in your IDE to see line by line where bugs occur and which information is present when the actual error comes into place. Furthermore, you can make changes on the fly and see what happens with your cluster without actually modifying your production workloads.
For a better understanding of interceptors, read the latest documentation on the official website of Telepresence.
Logs to the rescue
Often you need to explore the logs of your Kubernetes resources. This can be a massive task when there are a lot of Pods to be viewed. Suppose your application contains multiple micro-services that run all in their own container. You need to view the logs from the front-end container, the backend service, the messaging service, the container that captures the API calls, and perhaps a container that is used as an authentication service. If all of these containers also use a sidecar container to keep an eye on what’s going on, there is a lot of work needed to view all of the logs.
Off course, you can write a (simple) script to pull all information together. It would be a better idea to use a tool such as Kubetail to aggregate the log results for Pods, Deployments, or other resources together. This helps to speed up and fetch logs in a consistent way. It’s great to see that this tool is available for all major Operating Systems and it is updated recently.
Other log aggregators are FluentD and FluentBit which operate in your cluster itself.
Spying on objects
Sometimes, you want to analyze the Kubernetes resources that gives problems. With Kubespy you can observe your Kubernetes in real-time. Spying on those objects reveals valuable information about what happens when a Pod is booted or when a Service allocates an IP address. Kubespy is a tool written in Golang to trace the status of Kubernetes resources to find out what happens under the hood in a certain time period. You can run it at any given moment in time during the lifetime of a resource.
The output is written in a JSON file so you can easily parse it. Fields that change (frequently) during the lifetime of the resource are highlighted and colored to locate them easily. Good to know, Kubespy is derived from the CLI tool Pulumi.
There is more
The above-mentioned tools are just a very short list of numerous tools that help developers and Kubernetes platform operators to their clusters. Other tools that recently saw the light are K9s, Knative inspect, Kubeman and Kubectl debug. Find all of them on the Kubetools GitHub page.
Conclusion
Kubernetes clusters are all based on a large number of components. Various things can go wrong during and after the deployment of applications. Multiple microservices which together form an application should be analyzed when developers or platform operators seek the root cause of a problem. Don’t underestimate the complexity of Kubernetes networking. Tools like Kubectl, Telepresence, and Kubetail offer ways to debug Pods, Deployments, Services, and other Kubernetes resources. Hopefully, this article offered some insights into the tools and tricks you can use to debug and troubleshoot your workloads so you can continue your valuable work faster.