What is a Service Mesh and why do you need it?

Control plane — Source: https://stocksnap.io/

Probably not a surprise for anyone in the DevOps world: monoliths are becoming microservices. However, new challenges arise.

How to split them up based on business domains?
When to separate the data-sources?
Keep the number of them in control

Teams face these challenges in the development phase when building and testing applications. What about the deployment phase? Yet another solution is added to the DevOps toolkit: Service Mesh. In this article, I will explain what it is and why you might need it.

Challenges with microservices

End-users of your applications have high demands nowadays. They demand new features as soon as they heard of them and they want them now. Companies strive to zero downtime as much as possible in order to keep service outages to a minimum. However, the deployment of every new version poses a risk: things can go wrong in a lot of places. Let’s address common challenges.

External dependencies & a high number of microservices

Microservices depend on other systems and other teams who build and maintain them. In addition to that: teams have to strive for resilient applications to deal with infrastructure hick-ups or other services that can fail at unexpected moments in time.

A big number of micro-services — Source: https://stocksnap.io/

How to keep track of the big number of microservices: which one depends on another, and what is the health of the overall system. Things become even more complicated when running multiple versions of the same microservice next to each other. Some team requires a specific version and another team demanded new features which are only available in the latest version.

Infrastructure components & security

Every application has to include specific infrastructure components and configuration on which it depends. Think of the provisioning of certificates and handling DNS entries for your end-points. These kinds of things might be difficult to automate. This can be due to technical limitations but also to organizational aspects (e.g. in a heavily regulated environment). Very often the load balancer handles SSL termination. This means that the service-to-service communication within your deployment environment is unencrypted.

Who does what?

Those kind of challenges should be addressed. Most of the time, it’s the DevOps team that handles them. The next question which pops up: is this part of the application itself or should it part of the platform on which the application is deployed? Basically, the DevOps team should not care about this so much, since their primary focus is to deliver new business features for a limited (business-related) scope.

A service-mesh can help you address these kinds of challenges.

Positioning of a Service Mesh

Simply speaking: a service mesh positions itself between the applications and the network. From a high-level overview, the following capabilities are offered by most service mesh vendors:

Load balancing including routing traffic to the right microservice
Encryption of traffic flows
Handle authentication and authorization of your users and systems
Improve traceability and service discovery of microservices.
On top of that, it can control policies and configurations of your Kubernetes clusters.

When using microservices architecture and deployment patterns, all of the above-mentioned challenges act as an impediment to deploy faster and more reliable. If every DevOps team would handle them differently, perhaps even for every microservice they build, things get out of control very quickly. A service-mesh can handle these issues from a centralized point of view.

Main benefits

Roughly speaking all service meshes offer the following high-level benefits:

Operational control. Access control can be controlled by the platform- and security teams. DevOps teams can customize them to suit their needs. No need to set them all up from scratch by each and every DevOps team. Configuration drift between teams and environments is avoided.

Observability. This is about the next step of system monitoring. Observability takes into account the context and behavior of systems instead of just handling raw infrastructure metrics. Microservices are not treated in isolation, it’s about the entire context. The overall systems’ health is important here since a lot of components depend on each other to function correctly.

Security. In a cloud-native world, the security perimeter is not so clear anymore. A Service mesh helps to identify and control the traffic which enters your cluster as well as the traffic which flows between your microservices (both inside the cluster as well as traffic which leaves your cluster).

Improved user experience. From a developer’s point of view, a service mesh frees the developer to manage infrastructure-related components. His/her focus shift (back) to building meaningful software. And from an end-users’ perspective, it aims to provide a smoother experience when thinking about handling errors in a consistent way or provide graceful degradation instead of failing with a meaningless error message.

Features

Let’s do a deeper dive into the typical features of a service mesh.

Observability

Platform operators need to be able to troubleshoot, maintain and optimize workloads to keep them running smoothly. this is where observability comes into the picture. The following items are important to make observability practical:

Monitoring and metrics. Most service mesh providers offer a so-called “mesh control plane” which shows monitoring information like traffic, errors, latency, and saturation. Without a service mesh, these metrics are all based on individual services. There is no correlation between them. Service mesh groups services together to offer a holistic view of the metrics. This is an important aspect in the world of microservices since an application spans multiple services which all behave differently.

Access logs. In one of my previous posts I wrote about Application Performance Management and provided a number of tools that can help you find performance bottlenecks across. Tracing a user request from start to finish (a full transaction) is needed to pinpoint it. A service mesh can help achieve that without having the need to install and configure a separate tool. Metadata about the source and destination is included as well. Auditing service behavior down to individual workload instances becomes a reality.
Distributed tracing. One of the hardest things about a bunch of microservices is to have a good overview of the interdependence of the microservices itself and their external dependencies. Traffic calls are difficult to trace between them. A service mesh generates trace spans for each service so platform operators can track those kinds of issues. This greatly improves the visibility of what is going on inside the cluster.

Security

End to end encryption. A lot of service mesh providers offer mutual TLS. Simply speaking this means that all of your traffic is encrypted. A typical deployment pattern without mutual TLS consists of a Load Balancer that handles SSL offloading. All traffic up until your load balancer is secure, but the traffic inside of your cluster is insecure. This makes your applications are vulnerable to man-in-the-middle attacks. Mutual TLS overcomes that problem. For example, with Istio, you can enforce mutual TLS per Kubernetes namespace (section) or configure it for your entire cluster as a whole.

This brings us to another key characteristic of a service mesh: there is no trusted perimeter anymore. In traditional data-centers your enterprise-level firewall keeps the bad guys out and everything inside of your network is treated as “trusted”. From the perspective of a public cloud, this paradigm already changed a lot, since the trusted perimeter varies from service to service. In a service-mesh the concept of a trusted perimeter is completely absent. By default, nothing is trusted anymore. Enter “zero trust platforms”.

Optimization

In a zero-trust platform such as Kubernetes, workloads should be strictly isolated. It’s pretty easy to achieve that since you can operate a cluster per application. However, your cloud bill goes up very quickly. Namespace separation would be an option to logically isolate applications (if data and legal compliance rules allow this). A service mesh can help with that. In case you set up a small number of clusters per zone or region and only a single control plane per zone or region, you reap the following benefits:

lower costs, since you only run a single control plane
high reliability since you run the clusters in a separate zone or region

Besides the number of clusters that can be reduced there is more to optimize. For example, you can reduce the number of environments like TEST & ACC to maintain. Consider modern deployment patterns like A/B testing and canary deployments, you don’t need to set up so many TEST environments. In order to do reliable tests, your TEST environment should resemble production as much as possible. When using a service mesh you can reuse an existing environment like the ACC environment to run your tests in isolation. It’s even possible to utilize the PROD environment to run tests. Make sure you have proper separation so you won’t affect your production workloads. Since service meshes can control the traffic flows between your microservices, this can be tweaked very well.

It greatly reduces the operational costs to keep those environments up and running and in sync with each other.

Improved resiliency

Applications in the public cloud need to be resilient to limit the impact of failures due to latency spikes, connection timeouts, sudden failures of other services, etc. In one of my previous articles I presented a list of examples on how to achieve it. One pattern to achieve this is the “circuit breaking” pattern.

A service mesh can help you implement this by setting a destination rule for the traffic which hits your application. In the destination rule of the incoming traffic to your application you can configure parameters like the maximum amount of requests per connection, max number of connections, max pending requests. If one of those thresholds is reached, the circuit breaker kicks in. Your application then needs to handle it.

Without a service mesh, you would have to configure all of this as part of your application. It requires a whole new mindset and specialized knowledge to do this. The time spent on these kind of non-functional requirements can now be spend on building new business features which serve the business domain of the application.

Do you need it?

When viewing this list, you might ask yourself the question: do I really need a service mesh? There is no single yes or no for this question. However, if you face one or more the following challenges you might consider one.

You run a large number of microservices which continuously becomes harder to manage. DevOps teams struggle to operate them. No one is able to see the global overview.
Delivering new business features is hampered due to the amount of time that the DevOps teams need to spend on the operational aspects which should be handled by the operating platform.
Due to the large number of microservices, the attack surface grows too big for the security teams to handle. Security concerns cannot properly be controlled anymore.
Multiple DevOps teams developing a single application which requires proper access controls.
You want to expand the usage of Kubernetes for your microservices and roll it out across your entire organization.
Teams use a variety of programming languages and frameworks to build their application. With this in mind, it becomes much harder to also operate the infrastructure-related aspects like certificate provisioning, SSL termination and traffic flows.

As you can see from the list above, it requires a rather abstract view which goes beyond individual applications to decide whether or not consider a service mesh.

Popular tools

Once you decided you need a service mesh, here are some popular tools:

Istio. Perhaps the most famous service mesh. Istio uses the following phrase to emphasize it’s reasons for existence: Connect, secure, control, and observe services. Istio supports a single cluster as well as multiple clusters and it can be installed using its own command-line tool called istioctl.
Aspen Mesh. This is a commercial product which adds a lot of extra features on top of Istio. The slogan of Aspen Mesh is: Harness Istio without the Headaches. That said, it promises a much simpler control compared to Istio. You can request a demo to see if this is valuable for your organization.
HashiCorp is famous for its IaC solutions like Terraform and Packer as well as their secrets management solution called HashiCorp Vault. Their service mesh is based on Consul and their slogan is: Create a consistent platform for modern application networking and security with identity-based authorization, L7 traffic management, and service-to-service encryption.

While Aspen Mesh can be seen as an extension to Istio, it’s also interesting to note that other companies also support or extend service meshes:

Prisma Cloud Compute (formerly known as Twistlock) supports service meshes for their cloud native security features. With this solution, the security of the entire mesh becomes more robust.
AWS introduced AWS App Mesh in late 2018 to support Amazon ECS and EKS. Its primary goal is to better run containerized microservices at scale.

Conclusion

A service mesh brings a lot of benefits if you run a large number of microservices that are hard to control. If you want to shift the burden of having this being operated by your DevOps teams, it’s a good option to consider. Be sure to check out a number of use cases to see if this fits your situation. If it does, good luck to implement it.