Containers are the de-facto standard nowadays to run your applications. Kubernetes is the number one platform to orchestrate them. CI/CD pipelines as well as cloud-native services put an extra layer of complexity alongside persistent storage and (container) networking. All of these aspects need to work together to serve your end-users their favorite application or website. Many industries need to conform to a lot of rules and regulations in order to stay compliant. Auditors assess those companies to judge whether or not they are “in control”.
In this article, we’ll explore which aspects are relevant to auditing your Kubernetes cluster and how you can make sure you present crystal-clear results to any auditor that visits your organization. It’s crucial to capture a chronological set of records that depicts the sequence of relevant (security) events in a Kubernetes cluster. Every cluster should monitor activities that are generated by applications, the Kubernetes API itself (which is at the heart of everything), and the end-users themselves.
When it comes to auditing, cluster administrators need to be able to answer the following questions:
- What exactly happened, when did it happen, and who or what initiated it?
- What is the object that is impacted and where was it observed?
- From where did it come and where was it going?
Essentially, these questions help to craft a clear picture of everything that is relevant for every Kubernetes resource that lives within the cluster. Both on the hosted components such as everything that is part of the control plane as well as the (custom) components that are the responsibility of the end-user (platform teams, cluster operators, and DevOps teams).
Depending on the solution you choose (more on that later) it’s crucial to understand why you might require auditing. Since Kubernetes is a part of your application infrastructure that handles secrets (to connect to other internal or external services), it processes important customer data and hosts mission-critical workloads. You need to protect these aspects as well as you can.
Hardening a Kubernetes cluster is vital in a (cloud native) enterprise-grade environment. The Center for Internet Security (CIS) offers a Kubernetes benchmark to check the current security status of your Kubernetes cluster. Microsoft offers a great list that shows all of the aspects which are evaluated. These apply to AKS, but you can also use that as a reference for other (managed) clusters.
More than security
Besides security-related aspects, auditing also helps in other areas. Think of tracing slow API requests that require investigation or authentication issues to trace unexpected activities that need to be analyzed.
Build-in features of Kubernetes
Kubernetes offers (basic) native support for auditing events.
It all begins in the Kube API server which essentially records events in one of the following four stages:
- RequestReceived: an audit handler receives the request before it’s delegated to the remaining chain.
- ResponseStarted: the phase that exists before the response body of a long-running request is sent but right after the header is sent.
- ResponseComplete: follows up on the previous event, when the entire response body is completed.
- Panic: depicts events that are generated when a panic occurs.
Every request in every stage generates an audit event that will be processed by a Kubernetes policy. This is a built-in feature of Kubernetes itself therefore you can use it without any additional tools. Based on the policy, the event is written to a so-called back-end either a log file or a webhook.
The built-in policies control the audit levels of the events. Audit levels are controlled through rules which are evaluated from top to bottom. You can define the following rules: None (never log events that match the rule), MetaData (only log metadata such as resource, verb, timestamp), Request (both log metadata and the request body), RequestResponse (log everything of the 2 rules above).
Policy files are just like other Kubernetes manifests (of kind Policy) and written in YAML as well.
Two examples to make it more practical:
apiVersion: audit.k8s.io/v1 kind: Policy omitStages: - "RequestReceived" rules: # Log changes of Services at the RequestResponse level - level: RequestResponse resources: - group: "" resources: ["service"] # Log changes of secrets everywhere
- level: Metadata resources: - group: "" resources: ["secrets"]
Auditing needs to be enabled since it’s not by default. You would do so by changing the Pod definition of the kube-api server. In this YAML file, you need to refer to the persistent storage location to save your log files. Besides, you need to reference your policy file (f.e. put them in /etc/kubernetes/auditpolicies/policy.yaml).
Google also offers a configuration helper option to assist you when enabling and configuring auditing in your cluster.
Since many people dived into this topic, it’s wise to pick their best practices to quickly get you up to speed. Among them, the following tips would help:
- Connect your logs to a visualization dashboard so you get informed about important events (such as the deletion of critical pods, changing secrets, or other important configmaps).
- Restrict access to the log files themselves, and protect them as well as the storage location in which they are saved. No one should be able to tamper with them.
- Instead of storing logs to a (local) filesystem, you can let them send them to external endpoints. Keep them (far) away from your cluster itself.
Don’t insist on saving time to create a minimal policy file. It would be much better to get too many audit events in the beginning than to miss out on crucial ones. There is always room to sharpen your policies to optimize your auditing events.
Auditing logs can be time-consuming and difficult especially when you run Kubernetes “at scale” in large data centers. Everything is dynamic and changes constantly, also on “not so busy” clusters since Kubernetes itself sends constantly requests to their internal components (most importantly the Kube-api server).
It’s not enough to know that all of these events happen “under the hood”. It’s about the contextual information that should trigger a (manual) action from your side. For example, you need to know the following before you should be alarmed:
- filter on critical workloads like databases
- which users and roles have a good reason to access it?
- when did it happen?
- and what happens during this specific auditing time period (such as the admin user logging in to create a snapshot)?
Another challenge is to correlate events to detect suspicious behavior. An example. It’s pretty easy to detect a number of failed authentication attempts one by one. That would not be enough. Adding more context is needed to draw a conclusion here. A number of failed authentication attempts with username X on a certain period from an unknown host combined with unsuccessful attempts from the same user from a trusted host can be an indicator. Especially if that specific user has logged in successfully from the validated and trusted host.
Getting your audit logs in a safe location is one. Making use of them is another. Luckily there are several tools to help you visualize your logs so you get a quick view in what’s going on in your cluster.
A collection of popular tools in the monitoring and logging landscape:
- Grafana is one of the best-known tools to visualize logs in Kubernetes clusters.
- The ELK stack is a completely open-source stack for log management. Elasticsearch makes it possible to search your logs while log-stash aggregates them and Kibana visualizes them (just like Grafana).
- FluentD is a data collector to aggregate, transform and pushes logs to various endpoints in order to be analyzed or processed further.
- Prometheus: a very powerful tool that offers a lot of Helm charts to process metrics collected using previously mentioned tools.
A lot of these tools are offered based on open source initiatives or require a relatively small fee to use advanced features. Since many large companies such as Microsoft and AWS also use them, there is a high level of trust in them. A lot of them are backed by the CNCF.
eBPF: the new hype?
Detecting and reacting to (security) events in the auditing phases is a good thing. Having properly configured auditing rules and tools in place helps to keep your cluster healthy and your data safe. But it would be much better to actually prevent bad things from happening at the first place. Quite recently, eBPF saw the light. This new technology operates very well in the cloud native landscape for a number of topics: auditing, performance, and security.
eBPF stands for Extended Berkely Packet Filter. In simple words: it captures network traffic between workloads (on the application level) and the hardware that actually executes this traffic (think of the memory, the network interface, and the CPU). eBPF makes it possible to let you interact between those two systems in a programmer-friendly way.
Right at the core (kernel) of the Linux system.
Two examples/advantages of what you can do with it:
- Evaluate requests from a Kubernetes resource before they’re evaluated with a higher-level run-time security tool. Drop those requests when they are malicious. Your security tool is not involved anymore.
- Find the fastest path to route traffic from your load balancer(s) to your actual workloads. This makes applications more optimized for high-speed traffic. Thus a better user experience.
- Audit events to determine if they need to be processed further or logged at all before they proceed further in the chain between component X and component Y.
Before eBPF, programming those kinds of rules were very difficult and took a lot of time to actually be adopted by the community since everyone was affected by them. Now, this kind of use- cases become easier to use thus a lot of companies already jumped on this fast-running train. Among them are the following tools/companies: Bunblebee, Hubble, Cillium.
It’s best to check out the eBPF website to explore what it can do for you in terms of your auditing requirements.
As seen in this article, there are good reasons to audit your Kubernetes clusters. Not only from a security perspective but also to constantly verify if your cluster behaves as it should avoid slowness and downtime to your workloads.
Kubernetes offers a built-in feature to capture events and log those to external sources. Besides this, there are several other tools to collect, transform, aggregate, store, and visualize useful events. Some of them offer a graphical user interface to quickly spot problematic issues. Besides reacting to events that already happened, it would be more efficient to actually prevent them. eBPF is quickly becoming more popular and this low-level set of technologies actually helps to boost security-related aspects and many things more before something bad actually happens.
If you have questions related to this topic, feel free to book a meeting with one of our solutions experts, mail to email@example.com.