This is the fifth part of a series of articles attempting to lay down a road map to kickstart a potential journey into SRE. In the first part of this series, we discussed MADARR and its relationships with the DevOps principles of continual development and continual improvement. The Second Article introduced the concepts of Observability and discussed the first stage in the journey outlined in the diagram below. The third article discussed moving your fledgling SRE practice into a pilot stage. Our fourth article investigated the architectural concept of Blue-Green and the potential improvements it could bring to stabilising a platform during an upgrade cycle
In this article, we will delve deeper into load and scale implementation.
Brief Recap – or I can’t be bothered to read the other articles.
In the first article in the series, we introduced the prime concept of SRE, that of MADARR. Measure, Analyse, Decide, Act, Reflect, and Repeat, which is the infinity loop of the SRE. MADARR posits that complete cover monitoring or total Observability will remove the dark spots that hide unknown-unknowns. Furthermore, Observability allows Site Reliability Engineers to analyse the information gathered to better identify weaknesses in service areas.
The second post developed further the concept of Observability and how it differed from Monitoring. Next, we discussed the low-hanging fruit involving the use of code pipelines, which tighten the rigour surrounding IaC and application installation and configuration. Finally, explaining how load and Scale consulting aids in the reduction of toil and improving service resilience by allowing the environment to be correctly sized for purpose and scaled for future expansion.
Our third article discussed the low-hanging fruit that makes up the building blocks in migrating an Operations team from reaction-based Operations to one based on continual improvement. Here we start to look deeper at procedural and structural changes to processes that an IT department or MSP uses to deliver service to their users and clients.
Our fourth article developed the concept of Blue-Green Deployments and the benefits that it can bring to an architectural design from the standpoint of stability and the improvements to code and feature updates deployments in terms of quicker failback and safer staged rollout procedures.
Cloud is all about Linear scaling, or elasticity. The ability to expand and contract your compute, storage or network capacity at will or need is core to that concept. This flexibility is one of the traditional drivers for Cloud, be that Public on the Hyper-Scalers, Private and in-house or hybrid across both Public and locally on private datacentres.
If you followed the series and implemented some or all of the recommendations, you are well on the SRE journey. Your deployments have been coded and, therefore, standardised. You may have increased your Observability through Monitoring and log centralisation. Further, you have broken down walls to increase collaboration between previously silo’d teams flattening the communication box. You have travelled a lot of ground, and you are starting to act and think like an SRE.
Building out Load and Scaling an Implementation
As we have already mentioned, elasticity is a core feature of Cloud computing. All three Hyper-scalars have the concept of autoscaling groups, the ability to grow a compute platform to handle traffic peaks or contract down as the traffic mellows out into a trough. These are based on arbitrary metrics from your Monitoring, for example, CPU and/or Memory utilisation, thread counts, disk usage or queue length. These static metrics can give a semblance of resilience and Scale to an environment; however, they often miss performance issues caused at the application level. Another problem is purely architectural; all the hyper-scalars build out autoscaling groups based on a set number of nodes and a single metric – CPU Utilization. For example, on Azure, the default auto-scale value is “node CPU average at 75% for 10 minutes” power on an extra node. There is inherent wastage and delay built into this system. Application performance will most likely already have degraded across the system well before a node’s CPU utilisation reaches a constant 75% or greater utilisation. However, more worrying is that the scale-in policy will not destroy a node until the CPU utilisation reaches 25% or lower. If servers have been correctly right-sized, they would be running at a much higher average than 25%; therefore, the Scale down may never happen, leading to wastage. From a stability point of view, this is quite wasteful.
So how do we make elasticity more responsive and flexible?
A single auto-scale trigger based on CPU is very limiting; other metrics, Memory utilisation, for example, can often constrain an application causing an outage or degradation in service. Understanding how to scale an application based on different metrics is where things get a little more complicated; AWS and Azure require additional configuration and resources outside of a standard deployment to use alternative triggers. Azure, for example, requires the deployment of Application Insights, whilst AWS involves the configuration of a customised metric in CloudWatch. Subsequently, this leads to greater complexity in your designs and a greater risk of stability issues due to misconfiguration. Unless, of course, your environments are deployed using IaC.
OK, now we can be much more flexible with our trigger metrics, but our response is still heavily manual; how can we remove more toil from our day-to-day process?
Rounding out Observability with Application Performance Monitoring (APM)
As already stated in previous articles in this series, Observability is key to everything an SRE does; all SRE decisions are knowledge-based. We now have complete visibility into our infrastructure; we have, however, partial visibility into our application and UI stack. For example, we may have Monitoring on Databases and other legacy monolith applications like SAP or SAS. But, complete End-to-end visibility will be gained only with the implementation of APM toolings, like Datadog, New Relic or LogicMonitor APM.
So what does APM give us in terms of our Observability? According to Cisco, “Application performance management, or APM, is the act of managing the overall performance of software applications to monitor availability, transaction times, and performance issues that could potentially impact the user experience.”
Although often used interchangeably, the terms Application Performance Management and Application Performance Monitoring are different beasts. A significant distinction is that Application Performance Monitoring typically refers to the capability to see the data and metrics behind various components of an application. In contrast, Application Performance Management refers to the practices and processes of decision-making and changes based on data observations made within the application workflow.
Some of the core features of an APM tool are:
- Synthetic Monitoring: – Synthetics are simulated transactions that mimic an end user’s experience within a website or Application. These allow an SRE to identify what is good, which will feed into the Application SLI (Service Level Indicators) and SLOs (Service Level Objectives) and act as an early warning indicator for performance degradation.
- Real-User Monitoring: RUM is similar to Synthetic Monitoring but is used to provide a real-time response based on real user interaction.
In the real world, Synthetic Monitoring will be used to benchmark an application to build out the value of what is Good. Moreover, real-time Monitoring will aid in troubleshooting as it can be checked against Synthetic results to show whether performance is good or degrading.
The metrics that APM tooling produces can be polled to obtain real-time indicators of application performance which are much more granular and focused; these can then be fed into your autoscaling algorithms to improve your scaling reactivity. A fully functional APM solution completely wraps up the observability question; now you can really see the meteor storm and the stars.
Another function of APM is the ability to trace an applications flow from user to components. a Trace is the process of tracking transactions within an application as different parts of the application respond to them.
In complex and distributed applications, there are often multiple components to an application which are typically involved in each transaction. The user submits a request on the application frontend to start their transaction, for example, which in turn triggers a request to the database to retrieve data needed to handle the transaction. From there, the application processes the data and sends the result back to the user on the frontend.
By tracing these transactions as they move between the various parts of the application, the SRE team can gain granular visibility into application performance. Instead of simply knowing that application latency is high, which incidentally, is something that APM can do for you, tracing lets the engineer pinpoint which part of the application whether the it be the frontend, the database, the business logic, or something else, is the smoking gun that is causing the latency or other performance problem. The majority of modern APM products utilise tracing data behind the scenes to connect the dots and provide causal relationships and dependencies, but they rarely offer the ability to inspect each transaction in detail. From that point of view, we can describe distributed tracing as a building block for APM, which is more a description of a use case.
Conclusion
We are approaching the end of our SRE journey; in this post, we looked at further increasing application stability by using Scale and load-balancing. This, coupled with designing autoscaling groups that utilise alternative auto-scaling triggers such as Memory, grows and shrinks an application scale set. We looked at how to close out the observability journey by adding an APM tool that offers deep insight into the Application’s performance from the end user to the lowest functional component. We discussed utilising APM Metrics to add greater flexibility to a scale set’s triggers. In our next post, we will start to close the circle and deliver enlightenment. We will begin to look at the concept of AI Ops and how that can improve stability through automated triggers and also look at the higher-level concepts of Chaos Engineering and the Simian Army.
If you have questions related to this topic, feel free to book a meeting with one of our solutions experts, mail to sales@amazic.com.