This is the final part of a series of articles attempting to lay down a road map to kickstart a potential journey into SRE. In the first part of this series, we discussed MADARR and its relationships with the DevOps principles of continual development and continual improvement. The Second Article introduced the concepts of Observability and discussed the first stage in the journey outlined in the diagram below. The third article discussed moving your fledgling SRE practice into a pilot stage. Our fourth article investigated the architectural concept of Blue-Green and the potential improvements it could bring to stabilising a platform during an upgrade cycle; our fifth article spoke about the benefits of APM in rounding out Observability and the concept of the User as the core metric of service availability.
In this article, we will delve deeper into load and scale implementation.
Brief Recap – Or I Can’t Be Bothered To Read The Other Articles
In the first article in the series, we introduced the prime concept of SRE, that of MADARR. Measure, Analyse, Decide, Act, Reflect, and Repeat, which is the infinity loop of the SRE. MADARR posits that complete cover monitoring or total Observability will remove the dark spots that hide unknown-unknowns. Furthermore, Observability allows Site Reliability Engineers to analyse the information gathered to better identify weaknesses in service areas.
The second post developed further the concept of Observability and how it differed from Monitoring. Next, we discussed the low-hanging fruit involving the use of code pipelines, which tighten the rigour surrounding IaC and application installation and configuration. Finally, explaining how load and Scale consulting aids in the reduction of toil and improving service resilience by allowing the environment to be correctly sized for purpose and scaled for future expansion.
Our third article discussed the low-hanging fruit that makes up the building blocks in migrating an Operations team from reaction-based Operations to one based on continual improvement. Here we start to look deeper at procedural and structural changes to processes that an IT department or MSP uses to deliver service to their users and clients.
Our fourth article developed the concept of Blue-Green Deployments and the benefits that it can bring to an architectural design from the standpoint of stability and the improvements to code and feature updates deployments in terms of quicker failback and safer staged rollout procedures.
The previous article in this series, the fifth, developed further the concept of Scale and Load in regards to enhancing application stability and investigated the benefits of APM (Application Performance Monitoring or Management)
As we stated in the conclusion of the last article, it is time to let loose the chaos and approach nirvana at the head of your simian army.
What The Heck Is The Simian Army and Why Do They Cause Chaos?
A little bit of SRE history
Netflix, the global streaming giant, had in 2008 decided to move its service to AWS due to the ever-increasing costs consummate with their growing business. Around 2011 the greater IT community started to hear about a service Netflix had introduced to address the lack of resilience inherent to the Cloud. Netflix had found that the lack of control surrounding the underlying infrastructure left them uncomfortable; they had no visibility, and AWS’s concept of a fault zone consisted of an entire availability zone or even a region. Greg Orzell, (a then Engineer at Netflix) had the idea to create a tool the test their application’s resilience by injecting faults and breakdowns into their Production environment; yes, you read that right, their production environment. A service that programmatically injects random errors and outages into a live service is a scary concept. The Operational engineers pushed back because their SLAs would be affected; the business attempted to push back because it could directly affect their end-users experience and the business’s bottom line if they left the service. However, it was argued correctly that Netflix had to move from their development model that assumed no breakdowns to a model where infrastructure breakdowns were an occupational hazard, i.e. breakdowns were considered inevitable. The Netflix developers needed to consider much more than just their application; they needed to consider implications against the entire stack.
The first service Orzel introduced was the Chaos Monkey, which randomly shut down EC2 instances. Over the next several years, the Chaos Monkey was joined by several other services that programmatically introduced faults, errors, latency, downed containers, shut down racks, and caused DNS errors, growing out to become the simian army. The biggest hitters of the Simian Army were the Chaos Gorilla, which simulates the failure of an availability zone and Chaos Kong, which effectively dropped an entire AWS region. The concept of Chaos has become so successful that it has become a field of engineering itself.
How do we introduce Chaos to drive our move to Nirvana?
The introduction of Chaos into an environment is not a trivial task. For example, consider this situation, a team member accidentally powers off a customer’s core line of business application, which causes several thousands of (insert your currency of choice) to your customer.
The chances are that person will end up fired without a reference, and the rest of the team will curse their name until the sun goes nova due to the extra work they have caused. This finger-pointing is one of the main reasons for a zero-blame wash-up after any incident. Why are we re-affirming the notion of Blameless wash-ups? Because with Chaos, you are deliberately inserting a potential customer-affecting outage into a system; yes, the reasons are to improve service by identifying weaknesses. However, it is potentially end-user affecting and could possibly result in reparations against your company by angry customers. Introducing Chaos involves a journey that not just encompasses your company but potentially any company that you have dealings with. Chaos is the Nirvana of SRE Functionality, a process that potentially elevates your Operational teams to a zen-like existence where a fault is exposed in a service and fixed without interruption to the end users’ experience. This potentially means a need for a symbiotic relationship with end-users of customers that are accepting of disruption during testing.
If you have been following the previous articles in this series, you should be able to see how the path we have been travelling has led to this point. Observability, Infrastructure and configuration management as Code, Pipelines, the introduction of Zero-Blame incident wash-ups and moving service targets from SLAs to targets based on SLIs and SLOs have all been a tunnel leading to Chaos.
Earlier it was stated that Chaos had become an engineering stream; this is true. Three of the four main Hyper-Scalars provide services that allow the introduction of Chaos engineering onto their platforms. The Azure platform has Chaos Studio, AWS has the Fault Injection Simulator, and OCI has MAA. GCP is the outlier here; that said, there are plenty of third-party Opensource offerings (Chaos Toolkit and litmus) and even some commercial products (Gremlin) in the marketplace that can fill that gap.
Chaos Engineering aims to increase the reliability and stability of the code deployment behind an application and the infrastructure supporting it; therefore, from this statement, it is apparent that Chaos engineering should be part of the development and testing process. This statement makes sense; why? Because code should be written from the perspective that the underlying infrastructure will fail. OK, but what about when an application moves into production? After the program has left the safe and sheltered development environment and entered into the big bad world, where there is infrastructure that is shared between other services, hidden infrastructure when deployed on a cloud. The best may forward is to start small. That is correct start with low hanging fruit, those areas are sure off. Web-interfaces, Clustered databases. This way you build up trust in the application and your users and customers will gain confidence in the concepts of Chaos engineering.
Conclusion
Well we have finally reached the end of our journey. There has been a lot to consume in this series, one of the things that is interesting is how the concepts of SRE dovetails so nicely with those of DevOps. Principles like the standardisation of deployment through code, using pipelines to streamline service deployments and remove human intervention, are all DevOps ideas that aid service stability. The SRE takes the idea of blameless washups after deployments and move the up the food chain into incident management introducing a robust environment for self reflection and service improvement. The Deep Obversability in to all aspects of a service or system, traditional monitoring tooling, Log and Event aggregation of not just the infrastructure but also the application stack gives a 360 degree view of what is good, allowing the SRE to infer where issues are or could be. Finally Chaos Engineering, which is the most potent weapon in the toolbox of the Site Reliability Engineer, can allow the SRE to test their inferences gained via Observability.
If you have questions related to this topic, feel free to book a meeting with one of our solutions experts, mail to sales@amazic.com.