In our first post in this series, we discussed SRE and why it is an integral paradigm concerning modern-day operational functionality. If you remember the principle of MADARR that we introduced in the first article, you will recognise that it is predicated on Observability. Without the correct and expansive information, an engineer tasked with making informed decisions will be hobbled in their decision-making. Now it goes without saying that the budgets available to AWS, Google and the other Hyper-Scalars are not available to most companies; so how do mere mortals like us scale our observability capability so that we can adequately Measure and Analyse?
This article will focus on your first steps into building your SRE capability; however, before we start that process, we will need to set some guidelines. What is suitable for a hyper-scalar like GCP or AWS is not necessarily reasonable or logical for a sub-$50M a year company. So we will need to define our company size. The vast majority of companies are small to medium-sized companies, but what shall we use as our “mythical” company.
Gartner, the business research and advisory firm, recommends defining small businesses as companies with 100 or fewer employees or less than $50 million in annual revenue. According to Gartner’s guidelines, medium-sized businesses have between 100 and 999 employees, or between $50 million and $1 billion in revenue; this seems like a broad brush, especially the definition of a medium-sized company. However, it is a definition. Therefore, our company will have approximately 500 employees and generate roughly $500M a year in revenue, so by Gartner’s description, it slots right in at the median point of a Gartner-defined medium business. In addition, it is a retail business with a heavy presence in bricks and mortar and has a growing online presence across multiple clouds such as Azure and AWS.
Now that we know this information, we have our guide rails; and we are in a reasonable position to make some assumptions about our monitoring requirements. But wait a minute, I thought we were talking about Observability; why mention Monitoring?
So what is it, Observability or Monitoring?
At the most simplistic overview, there appears to be no difference between Monitoring and Observability because both words have similar definitions. However, with regards to SRE requirements, the best definition I have found was defined by DevOps Research and Assessment, where they stated that:
Monitoring is tooling or a technical solution that allows teams to watch and understand the state of their systems. Monitoring is based on gathering predefined sets of metrics or logs.
Observability is tooling or a technical solution allowing teams to actively debug their system. Observability is based on exploring properties and patterns not defined in advance.
In layman’s terms, Monitoring is looking at what you know for errors, and Observability is looking at the complete picture and enabling the inference of issues by looking at patterns. Or, to use Donald Rumsfeld’s famous term, Observability allows the discovery of things previously unknown, those “unknown, unknowns” or something we did not know about simply because we did not know to look at them.
To an SRE, Observability is the crux of everything. So how do we move from a monitoring paradigm to one of Observability?
How to Implement Monitoring and Observability
At the base minimum, all Monitoring and observability solutions are designed to provide the following five things:
- Provide leading indicators of an outage or service degradation.
- Detect outages, service degradations, bugs, and unauthorised activity.
- Help debug outages, service degradations, bugs, and unauthorised activity.
- Identify long-term trends for capacity planning and business purposes.
- Expose unexpected side effects of changes or added functionality.
As with all DevOp and SRE capabilities, tool installation is not enough to achieve the defined objectives. In addition, tools can help or hinder progress to your desired endpoint unless configured correctly.
One of the main issues with traditional Monitoring systems is that the outputs are confined to a single individual or team within an organisation. An example of this would be the legacy tool CiscoWorks being deployed in the Network team and System Center Operations Manager (SCOM from Microsoft) in the WinTel and Lintel teams; further, the Storage team will have their tooling that they use to monitor variously deployed arrays. This situation leads to islands of Observability; silo boundaries delineate information available to groups outside the silo; this is not conducive to situational awareness on a global or holistic level.
It is said that knowledge is power. This adage is particularly true when considering the output of monitoring tools. Hence, the empowerment that traditional developers and DevOps Engineers can gain by making them proficient at understanding the results of Monitoring tooling and the output of logs and performance metrics is logarithmic. This change will help to develop a culture of data-driven decision-making and dramatically improves the ability to debug the overall system, which leads to the desired endpoint of reducing outages.
Here are a few keys to effective implementation of Monitoring and Observability.
First, your Monitoring tooling should tell you what is broken and help you understand why before too much damage is done.
The key metric in the event of an outage or service degradation is time-to-restore (TTR).
A key contributor to understanding TTR is knowing what broke and the quickest path to restoring the errant service; this may not involve remediating the underlying problem but inserting a workaround to restore the service pending a full fix.
There are two common ways of looking at a system:
- Blackbox monitoring, where the system’s internal state and mechanisms are not made known,
- Whitebox monitoring, where they are.
With Blackbox monitoring, you can only observe the metrics the service vendor has chosen to expose. Blackbox monitoring tends to be based on sampling against a published API. You see what you are allowed to see, so your logic tree must infer your pass and fail standard based on regular sampling and response codes.
White box monitoring relies on information sent from the workload to a central location, for example, Metrics, Logs and Traces. Either stored on the device for consumption or forwarded to a location like a Syslog server.
A little Side Knowledge – the Difference between Logs, Metrics and Traces
Logs and Metrics are essentially the same things; logs are historical and how the system looked before an outage; they are invaluable when looking for the smoking gun but are expensive to store. Metrics are just snapshots of the performance of a service that are either processed in real-time or based on a scheduled snapshot of relevant outputs. Metrics can be used as an indicator of good end-user performance. Finally, a Trace is a map of the lifecycle of a service process, watching its passage through the entire system. One downside is that a trace output is even more verbose than that of a Log, but they are a very effective measure of performance and aid in pinpointing system health failures.
Now that we understand the difference between Observability, Monitoring and Tracing, let’s investigate our SME’s tooling options; you might be surprised.
SRE Observability Tools
As we have already stated, an SRE engineer requires tooling focused on Observability to diagnose issues quickly and conclusively without setting up specific monitors up-front. In addition, Observability captures enough data in enough detail to answer questions you didn’t think of beforehand, as opposed to Monitoring, which is designed to answer questions that you know you need the answer for.
Observability tools provide insight into an application’s performance and display how well they run from a user’s perspective, this is the only true metric as they are the customer. These tools also collect strategic analytics that aligns the app’s continuous performance with SLAs (Service Level Agreements) and other operational guidelines, such as a SLO (service Level Objective) and SLI (Service level Indicator).
Knowledge Side Bar: SLA’s as an indicator of service performance has started to receive some bad press, for example, if an SLA states that a service must be recovered within 24 hours, and the service is not recovered for 36hrs because the service provider was waiting 24hrs for some information from the client; has the SLA been breached or is there still 12hrs left until the SLA is breached?
For SRE the pillars of Observability are
- Large amounts (“high-cardinality”) of rich (“high-dimensionality”) data points about software behaviour
- Log messages emitted from software components as they execute
- Request traces that trace the execution of transactions across sets of services.
There are many vendors who offer observability tooling, to name a few Dynatrace, AppDynamics, DataDog, New Relic, Honeycomb and LogicMonitor. However, they have one common theme, and that is price, with their costs based on devices monitored. Depending on the size of the infrastructure and associated services they can become very expensive very quickly. So how can we start on a budget?
The key to Observability we now know is visibility, if you gather the correct information, you can make the correct decisions. If you follow the correct trails you can infer answers from your data. This does not mean expensive APMs at the start. The image below shows a three phase implementation plan.
For the remainder of this article we will concentrate on Stage 1 as outlined in the above diagram.
Stage one: make your deployments easier and repeatable; this means pipelines to stabilise deployments. We are assuming that our company is already deploying infrastructure and applications using Terraform and Ansible into AWS and/or Azure. This means that our mythical company’s deployments are being standardised and are repeatable. However, without a pipeline engine, chance are that their deployments are still fractured and manual. Examples of Pipeline engines include CircleCI, Github Actions for a platform agnostic approach, or Azure DevOps or AWS CodePipeline for a cloud opinionated approach. Alternatively, you could go hardcore and install a Jenkins platform from scratch.
A fully functional pipeline management process is one of the easiest ways to increase reliability. Through the repeatable processes that a pipeline enables, deployment errors are reduced, with an inverse increase in stability. Another practical side benefit is a consummate reduction in toil or unnecessary work. Another advantage is that this is low-hanging fruit that delivers exponential returns on your reliability.
The second part of Stage one is load and scale consulting, or as I like to call it, proper capacity planning and forecasting. Gathering usage graphs over time, inferring growth patterns over time scales to be able to better forecast compute, networking, storage or even personnel requirements. This is a massive driver towards stability as issues due to lack of resources will be mitigated. Are your compute resources right sized on creation, what is your plan for growth, do you know when your peak resource needs are? 9am, 1pm and 5pm boot storms, Month, Quarter and Year-end surge traffic. Even looking at these at a high level and defining a plan will be a massive improvement in stability.
To sum up this post, it is safe to say that any improvement to your processes to increase reliability is a good thing, there is no requirement on the start of your journey for a massive investment in tooling. The improvement that can be gained with just Stage one is full of easy low hanging fruit, things that people have just become lazy at with the “cheapness” of compute resources since the introduction of virtualization technology; there is a perception that virtualization is zero cost. By re-learning the art of right-sizing through the use of Assessment and capacity planning an IT department’s operations and design teams can remove issues caused by a lack of necessary resources.
In the next article, we will start to look a little higher up the tree and investigate what can be done to start to infer where issues could arise and how to mitigate them through re-engaging with developers, architects and DevOps to re-engineer platforms for stability.
If you have questions related to this topic, feel free to book a meeting with one of our solutions experts, mail to firstname.lastname@example.org.