This is the fourth part of a series of articles attempting to lay down a road map to kickstart a potential journey into SRE. In the first part of this series, we discussed MADARR and its relationships with the DevOps principles of continual development and continual improvement. The Second Article introduced the concepts of Observability and discussed the first stage in the journey outlined in the diagram below. The third article discussed moving your fledgling SRE practice into a pilot stage.
In this article, we move on from the easy low-hanging fruit and will look at how we can further mature our SRE chops to improve service resilience and remove toil from the operational day.
Brief Recap – or I can’t be bothered to read the other articles.
In the first article in the series, we introduced the prime concept of SRE, that of MADARR. Measure, Analyse, Decide, Act, Reflect, and Repeat, which is the infinity loop of the SRE. MADARR posits that complete cover monitoring or total Observability will remove the dark spots that hide unknown-unknowns. Furthermore, Observability allows Site Reliability Engineers to analyse the information gathered to better identify weaknesses in service areas.
The second post developed further the concept of Observability and how it differed from monitoring. Next, we discussed the low-hanging fruit involving the use of code pipelines, which tighten the rigour surrounding IaC and application installation and configuration. Finally, explaining how load and Scale consulting aids in the reduction of toil and improving service resilience by allowing the environment to be correctly sized for purpose and scaled for future expansion.
Our third article discussed the low-hanging fruit that makes up the building blocks in migrating an Operations team from reaction-based Operations to one based on continual improvement. Here we start to look deeper at procedural and structural changes to processes that an IT department or MSP uses to deliver service to their users and clients.
Production and moving into the realm of SRE Nirvana – SRE and Design Improvements
In this article, we start to will look at the potential for architectural-driven improvements to improve service stability.
Architectural and Deployment Improvements
In traditional IT Operations departments, an often heard phrase is “if it ain’t broke, don’t fix it.” However, if you take that statement at face value, it means not looking at a service once it has been operationalised to see if it can be improved until such a time as it fails. This phrase is anathema to the SRE, where the concepts of continual improvement and service availability are front and centre.
If you remember the SRE tenant that every act should be towards reducing toil and improving service availability, improving your Design capability by integrating your SRE team into your Architecture team makes perfect sense. The ability to gain insights into improving your architectural design infrastructure and applications to enhance service availability is a powerful tool.
Architectural concepts like Blue/Green deployments are a powerful driver of zero downtime deployments. Blue/Green refers to an architecture where a second mirror image of a platform is deployed to an alternative availability zone or region and protected by an overarching load balancer construct. Blue/Green may sound like an expensive way to deploy infrastructure and applications. However, it does mean that a development team has a fully featured and sized environment to undergo large-scale regression testing. Further, it opens up some fascinating deployment strategies that build on service availability, but more on these later.
The diagram above shows a straightforward Blue/Green deployment consisting of a simple application or website created with an autoscaling group; it has been deployed to two separate resource groups and front-ended with a load balancer that offloads the SSL. One deployed environment is termed “Blue” and is the current live environment; the load balancer will direct all traffic to the “Blue” segment. The second environment, termed “Green”, is sitting there waiting for traffic that will never arrive.
The ability to automatically fail over to a fully independent site, with a simple rebalancing of load balancer’s rules, is a powerful construct, but this is not the most potent aspect of a Blue/Green deployment.
Blue/Green introduces other interesting concepts, like automatic role back during deployment when an issue arises, this can be achieved with the simple rebalancing of the rules attached to the Load Balancer.
What does a traditional deployment strategy look like? The most basic deployment strategy consists of a simple throw-it-over-the-wall push-out of a new application or a big bang. This results in the pure delight of a successful deployment followed by a party, or the much more likely outcome; chaos creation with Users complaining of issues and the loss of sleep and weekends. Yes, it does have the advantage of being cheap, but the risks associated with this form of deployment are many; a stressful working environment for the staff who are rolling out the deployment and a loss of prestige for the company delivering the update. Therefore, a basic deployment strategy should not be used for anything but a simple change. Furthermore, It also lacks a simple and efficient rollback method. So what could be a better methodology?
The rolling deployment is a deployment strategy that updates running instances of an application with the new release. All the application endpoints in the target environment are updated with the latest service over a set period of time in set batch sizes. This method has the benefits of being simpler to roll back due to a smaller batch size; thus, it is less risky than a basic deployment. Besides managing the deployment batch size, the implementation is simple and very similar to a basic deployment. It does, however, result in an increased service cost as there is a requirement to support both the new and old versions for a more significant time period. In addition, by design, deployments are slow due to the need to verify each batch deployment for issues. As can be seen, both Basic and Rolling implementations are fraught with stability and service availability issues.
So how can we make it better?
Blue/Green! Remember that concept? The Blue-green deployment is a strategy that utilises our two identical environments; one termed “blue” (aka production) and one termed “green” (aka staging). Testing will be carried out in the “Green” environment. Once all quality, scaling, and user acceptance testing has been completed, the production release is a simple swap of traffic priorities, with “Blue” becoming the Hot Failback and the “Blue” environment becoming the new production. After a period where the newly released application is operating correctly in production. The “Blue” environment can be upgraded to the same release level as the new production “Green,” and the cycle repeats.
Some of the more significant benefits of the blue-green deployment are that it is simple, fast, well-understood, and easy to implement. Rollback is also straightforward because you can flip traffic back to the old environment in case of any issues. This is a significant advantage for the SRE as these deployments are not as risky as other deployment strategies. The major downside to Blue-Green is the cost; replicating a production environment can be complex and expensive, especially when working with microservices. Other issues can be that quality assurance and user acceptance testing may not identify all anomalies or problems, just like the basic deployment methodology shifting all user traffic at once can present risks. Any outage or issue caused by the deployment could also have a wide-scale business impact before a rollback is triggered. Depending on the type of implementation, any in-flight user transactions may be lost when the rollback in traffic is made.
Taking aside the cost implication of a Full Blue/Green deployment strategy, it does allow application deployments to move to the next level; for example, the canary deployment is a deployment strategy that releases an application or service incrementally to a subset of users. Fundamentally it is similar in concept to the of the Rolling Deployment, but as it is carried out on the “Staging” side of the Blue/Green infrastructure the ability to update in small phases (e.g., 2%, 25%, 75%, 100%) is coupled with a high-speed rollback capability. Thus due to this control, a canary release has the lowest risk compared to all other deployment strategies. In addition, some of the benefits of a Canary Deployment allow an organisation to test in production with real users and use cases and compare different service versions side by side. Again, however, there are risks, as a canary deployment effectively involves testing in production. Furthermore, there is an overhead in managing or automating the release due to the more complex testing and necessary verification processes.
A/B testing is similar in concept to a canary deployment as different versions of the same service run simultaneously as “experiments” in the same environment for a set period of time. Several control methods can be used to control these experiments, such as feature flags toggling and A/B testing tools. The experiment owner is responsible for defining how user traffic is routed to each experiment and version of an application. Commonly, user traffic is routed based on specific rules on a load balancer or based on user demographics and is used to perform measurements and comparisons between service versions. Once the A/B testing has been completed successfully, the Target environments can be updated with the optimal service version.
Although A/B testing is not a deployment strategy per se but rather an advanced testing strategy focused on experimentation and exploration, new feature testing etc., it is a valid use case that can be brought to fruition with the implication of a Blue/Green architecture.
With this post, we introduced the concept of embedding your SRE team into your Architecture team to elevate the design process to consider service resilience. Further, we introduced the idea of Blue/Green deployments together with the potential deployment and testing improvements that implementing such a strategy will bring to an application or service. In our next post we will move on to productionising Load and Scale Implementation.
If you have questions related to this topic, feel free to book a meeting with one of our solutions experts, mail to email@example.com.