Several buzzwords floating around in conversational circles have pushed themselves to the forefront to such an extent that they have gained some traction with the Recruiters. One that seems to have particularly piqued the Recruiter’s interest is SRE, to the point that I now receive three or four calls a week from them about the subject and potential roles. Should I care? And if I do care, how do I become a Site Reliability Engineer; or how do I start to introduce the concepts of SRE into my current operational role? Before we move on to the bigger picture, let us first attempt to define SRE.
But what exactly is SRE?
The term SRE was first coined by Ben Treynor, a Google engineer, in the book Site Reliability Engineering in 2003 (yes, that’s right, almost 20 years ago, Google was utilising SREs to manage their availability, resilience and performance). Treynor defined SRE as “what happens when you ask a software engineer to design an operations team..” Naturally, this concept can appear a very frightening thought if you are an Operation person.
That said, eloquent as the quote above is, it gives no insight into what SRE is or are. SRE can be broken down into two primary concepts. Firstly, the processes and Procedures of “SRE” or “Site Reliability Engineering” and secondly, the individual the SRE or “Site Reliability Engineer” who is tasked to carry out the roles and responsibilities.
What is SRE the Process?
The Concept of SRE rose from the ashes of the Great Wizarding War of the late 1990s.
We are talking about the fallout from the first attempt at hyperscaling ASP and the Dot.Com bubble. Then, companies like Google, NetFlix, and Amazon, realised they could not scale to the sizes needed to provide global coverage with a traditional human-centric model; it was uneconomical in terms of costs, flexibility and agility. So the great and the good at these companies started to investigate ways to increase application stability by reducing application failures and infrastructure outages and improving scalability without increasing management overheads.
They started to look in-depth at where and when most errors and outages occurred and how to prevent a similar failure from causing a service outage again; this is the Crux of SRE. It is an end-user-focused view of Operations, where “user availability” is the only API; it is not predicated on a traditional SLA based on the mythical five 9’s. As you can see, this equates to just over 8/10th of a second outage a day.
As already mentioned, “five 9s” is focused on how long an outage can last. The concept of “Five 9s,” however, is not focused on the prevention of the outage; engineering around weak spots and Single points of failure, deploying applications and infrastructure that can survive the loss of a constituent part, or at the scale of Google, AWS and Netflix an entire DataCenter or service region. But, we get ahead of ourselves.
It was this thinking at the leading hyper-scalers that led to the rise of the concept of MADARR.
For those familiar with DevOps, the above will look very similar in concept to the DevOps infinity loop of Plan, Code, Build, Test, Release, Operate, Monitor and Feedback. You will be correct, as SRE and DevOps are but two sides of the same coin. One cannot exist without the other; they are symbiotic.
They have the same common themes of cross-function collaboration, Zero blame failure wash-ups, constant feedback and continual improvement. These concepts are built into the DNA of DevOps and SRE policies.
Who is SRE’s Target Audience
SRE is the opposite side of the DevOps coin; it is more focused on traditional Operations functions of resilience and keeping the lights on. However, it is embedded across the entire Business; it focuses on Architecture, building infrastructure and applications that can survive outages, and introducing concepts such as Blue/Green Deployments. SRE is also embedded in Change and Update management, again focused on constantly looking for ways to improve stability and automate out human error. Finally, SRE can be found in Operations and Build teams, where they take on DevOps processes of IaC (Infrastructure as Code) and CaC (Configuration as Code) to move Business as usual to a new level of stability; introducing Chaos Engineering to introduce failures into systems to find weak points which result in the ability to stabilise and improve service availability to the end-users, managing regular full end-to-end incident tests to find weaknesses in processes and fix them, and finally chairing Zero blame incident wash-ups, whose only purpose is to find the route cause, not to apportion blame.
Building out your SRE Practice
You are probably sitting there now thinking, how do I get to this Nivana of SRE. Remember, “Rome was not built in a day”. So start small, identify your major pain points and focus on those; think small progressive wins.
The first key to moving to SRE-focused Operations is to understand your environment fully; if you remember the Maxim of MADARR introduced earlier in this article, you will recognise that the first constituent of the term is Measure. So, in reality, that means Observability or visibility over everything. The chances are your company is already monitoring your environment. For example, watching Dashboards, hopefully noting what goes Amber or Red and then reacting to the outage before your customer or end-user is aware something is wrong. However, it is more likely that the first time you are aware of an issue is when the support desk phones start ringing, and you are already on the back foot and reacting to the problem rather than managing it.
However, if you are in the former category and actively monitoring your environment, you are well on the way to the first stage. Measuring the environment is key to understanding the environment.
“Sounds great”, I hear you shout, “We are already doing that!” but are you? Do you “really” monitor your environment? Do you Analyse the inputs into your monitoring tools? Are you even collecting the metrics that you really need to monitor? Or are you monitoring for failure or Audit (Blame), not interception or stability? Or, more to the point. Do you even look at the logs and events until you are in the middle of an incident looking for that smoking gun?
“Situational awareness” is the statement of the day here. Until you know what is happening, where your pain points are, and where your weaknesses are, you cannot even begin to improve.
Without situational awareness, you cannot decide where to focus your efforts for improving service.
Choosing your first step in the journey – Making your Changes
OK, you have been monitoring your environment, analysing your outputs, and finding the low-hanging fruit. But how do you implement it? This, my friends, is the hidden little secret of SRE, what keeps the engineers in bed at weekends and evenings. When improving something, if the process or function cannot be rolled back, go back to the drawing board and re-engineer it until it can be rolled back. For those, isolate the offending piece or process and break it into as small a set of pieces as possible. Those things you can deploy in isolation. If something breaks, your blast area will be minimal, and you have already identified precisely where the problem is.
Don’t forget your post mortem
Post-mortems or Washup after you have finished any change or implementation is particularly important; it is even more so if the change or deployment has had to be rolled back; “move and fail fast” or “fail fast, learn faster”. The object of the Post-Mortem is to review the change and the processes followed, considering what and how things could have been done better. Finally, remember to document what has changed and the lessons learned in a formal washout document. This is to protect the environment from that nasty little bus or train that everybody seems to fall under. Also make sure that your As-built documentation is also updated to reflect the changes to the deployed environment.
Rinse and Repeat
The final and often forgotten part of SRE is the Repeat part of the Acronym. This does not mean repeat the same mistakes, rather it means, repeat the cycle to improve the process by implementing the lessons learnt in the wash up, it also means choose another process or implementation to put through the process.
Conclusion and Next Steps
For those aware of the concepts of DevOps and its processes, the SRE concept of MADARR will be familiar because they are the opposite sides of the same coin and complement each other in improving stability and flow of work through a complex system.
Both are heavily influenced by Lean Manufacturing principles and concepts of Continual Development and Continual improvement. Thank you for reading this introduction into SRE, and in the next in this series, we will start to look at the tooling needed to properly implement the first stages of Monitoring and Analyse.
If you have questions related to this topic, feel free to book a meeting with one of our solutions experts, mail to email@example.com.