It’s not an exaggeration that nearly every company moves their applications to the cloud. Experts at those companies agree that you can only reap the benefits of cloud technology with modern architectures. Use cloud native technologies as much as possible. Soon, the number of applications and/or services reaches a certain tipping point. You lose control over them, no one knows what runs where and how well these applications perform. This is where observability comes into play. In this article we will explore the key aspects of observability and how your business can benefit from it.
Setting the baseline
Every DevOps team already knows how to measure critical infrastructure related metrics such as CPU load, memory usage, I/O, etc. All within the scope of keeping an application running based on technical terms. Not every organization constructs business related KPIs that measure the (potential) business impact of an application. The above-mentioned metrics become increasingly complex in distributed systems such as those running on containers or serverless based infrastructures. Teams require a better understanding how well their applications are performing. Besides this, in case of problems, they also need to find the root cause of problems as quick as possible.
Here is where observability finds it’s place in the ecosystem of software development processes. It originates from a control theory that describes the understanding of so called self-regulating systems. Now this is applied to the performance of distributed systems as well. Observability uses three types of data: logs, metrics and traces. It combines the gathered information to find problematic issues and also to improve the performance of the system as a whole.
Comparison with monitoring
Monitoring systems that tend to capture cloud native based solutions are not sufficient to gather this kind of information. Often, they lack to capture all of the events, communication paths and other aspects to trace an issue to where it originates from. Without this complete overview, there is no way to find the exact root cause since there might be several hundreds of even thousands of processes in place. It is also not sufficient to measure performance, since it won’t measure a request from end to end.
Monitoring focuses on a single application whereas observability looks at the entire chain of (inter) dependencies and architecture of the given application.
Why it is important
It’s important to have the full spectrum in place to optimize an application and for a number of reasons:
- It fills a gap to monitor complex systems, monitoring simply can’t handle these kind of distributed systems.
- Capture the high number of issues based on highly frequent changes that are happening.
- The issues are much more complex, observability also answers the need to monitor unpredictable systems.
Besides the initial investment (we will talk about it later) there are several benefits for a vast majority of stakeholders in the company. Generic benefits are better alerting, better workflow and faster developer velocity. How does this translate to the key stakeholders involved?
For the business owners
Business owners always benefit since observability frees developers from a lot of trial and error deploying new features. They can release faster, with more confidence and know up front how the change would impact the stability and performance of the system. Being better informed helps to keep downtime to a minimum. Business owners strive for happy customers all the time, observability helps to achieve that goal very well.
For DevOps teams
Developers participating in DevOps teams win because they get a better insight in their entire architecture. They have to spend less time figuring things out, tracking and tracing issue across the entire infrastructure landscape. This reduces friction and frustration since debugging and bug-fixing are not the most popular activities for them. Besides, they have to attend fewer meeting with other teams, so they can focus more on their core activities: programming great software.
For the entire department
Since observability provides a great overview of the entire system, operators, project managers, analysts and other subject matter experts benefit. Consider a large system with a lot of internal inter-dependencies, the communication overhead shrinks. Project sponsors would love this so can choose to invest their money in other projects that bring more revenue to the company.
As pointed out in the beginning of this article, you require proper insights into your logs, metrics and traces to have a good overview of your entire application infrastructure. If you measure all of these individually using separate solutions, you do not achieve the key benefits of observability.
What do you need?
So you need a tool to combine logs, metrics and traces. Let’s break these items down:
Logs give you a first indication when something goes wrong. They consists of a text record that describes an event at a particular point in time. It contains at least a timestamp, the current component (based on a unique ID) and an error and/or description of what happened. Logs can be provided in plain text, as a binary or a structured file. The last one makes it easy to query so it acts as a good source for your observability tool.
Metrics are structured by default and provide valuable information of longer periods of time. They are like a numeric value that measures “a specific item” over time. Think of measuring a business KPI or the number of recurring paying visitors of a website.
Traces are pieces of information which travel from start to end in the entire distributed system. They can (in fact: should) be uniquely identified. Besides this, they should contain important metadata such as the micro-service (or other infrastructure component such as a serverless function) that processes the request.
To successfully implement observability, you need to integrate the above-mentioned aspects using a proper tool. To create an observability system, be sure to include the following key aspects:
- Instrumentation: to collect the correct information from all of the infrastructure elements.
- Data correlation: process the gathered information to visualize and correlate it.
- Incident responses: make sure to dispatch the disruptive issues to the appropriate users.
- AIOps: needed to aggregate, correlate and prioritize the big bunch of data into meaningful information based on the systems’ impact, criticality, etc.
Criteria to select a tool
With these prerequisites in mind, the following criteria are a top priority to select a tool or to build it yourself.
Integration is king here. If your solution does not integrate the above-mentioned key aspects, your investment is worthless. So you need to integrate with your existing platforms (container, serverless, messaging), frameworks and (cloud) environments.
Collecting real-time data is crucial, since outdated data is useless to set the appropriate action. Therefore, your solution need to support modern event-handling techniques and use APIs to automatically gather the data in real time. It is no surprise that you should include the proper context of everything you collect. Without it, you don’t know much about the data and you can’t proper visualize and correlate it.
Anomalies should be detected automatically so it’s important to use a fast and self learning solution. Machine Learning can help here. It’s impossible to do this manually, since the amount of data is really too big.
User friendly dashboards that help the management pinpoint their efforts and give a proper overview of what goes wrong on multiple levels is very important. The more business value you get out of it, the better. Since your solution has an impact on a lot of people in the organization, it should be user friendly and relatively easy to adopt. Otherwise it won’t fit into your existing processes and it quickly looses the interest of the key sponsors.
Practical starting points
If you got exited about what observability can do for you and your DevOps team(s), it’s good to know where to start.
From a developer perspective, it’s best to gain practical experience by trying out frameworks, tool-sets and other technical related aspects. For example browse through the documentation and getting started guides of above-mentioned solutions.
Be sure to check out the OpenTelemetry website. It contains a great overview of the features of the observability framework for cloud-native software applications. Since it integrates with a large number of popular frameworks, programming languages and database systems, it aims to target a wide variety of developers.
From an organizational point of view, it’s best to prepare every key stakeholder in the organization:
- Inform and educate them about what observability really is. Explain the differences between monitoring and observability.
- Consult the current sponsors of the existing tools: this might change in the future to address the requirements for observability related aspects.
- Evaluate or extend the current KPIs and see if they need to be adjusted within the light of observability.
- Seek sponsors: you require a team working on it to implement the proper tools and processes. Besides, you need someone to pay the licenses for the tool or build the system yourself.
- Do a very proper evaluation of what is already offered by your cloud providers. Perhaps you can start small and extend later on.
- Build a proper business case to evaluate the business benefits against the initial investment. Remember, IT is not a cost center anymore, but you do require a business case to convince the right people in the organization.
- Make everyone enthusiastic by giving practical demos by yourself and/or other companies which already walked this path.
Observability is critically important to keep a good overview of your distributed systems such as the huge number of containers. Collect, process and analyze the combination of logs, metrics and traces into a single set of tools. Learn from the data you gather to detect anomalies.
This helps to make informed decisions by your developers about the next change to release. It answers to questions like: what will happen with the stability of the system and beside that, it helps to pinpoint root causes of issues that arise. In the end, you can further speed up your software development processes, but only after a first investment so your sponsors are important to take into account.
If you have more questions related to this topic, feel free to book a meeting with one of our solutions experts, mail to firstname.lastname@example.org.