Money isn’t the only thing at stake when systems go down. It can negatively impact brand reputation, customer trust, and overall productivity. Some businesses even say they’ve lost customers for good after long outages. This is why reducing downtime, even by just 30%, can make a significant difference not only in terms of revenue but also in maintaining customer loyalty and satisfaction.
I spoke with Benjamin Wilms, CEO and co-founder of Steadybit, in an interview where we explored reducing downtime. In this article, we’ll cover how chaos engineering works, practical strategies to get started, and how Steadybit’s platform has helped companies achieve remarkable improvements in system uptime.
Chaos engineering: Key to uptime improvement
Before discussing the solution, let’s consider how much downtime costs. EMA Research’s 2022 study revealed that unplanned IT downtime now averages $12,900 per minute. In 2023, this figure has risen to $14,056, with large enterprises facing costs as high as $23,750 per minute. Costs are escalating. This can add up to hundreds of thousands—or even millions—of dollars per hour. For companies that use complex systems, like microservices or cloud-based apps, it’s crucial to keep these outages to a minimum.
Chaos engineering involves introducing faults or disruptions into a system to observe its behavior under stress. This approach aims to simulate real-world failures—like server crashes, network outages, or unexpected traffic spikes—so teams can spot their system’s weak points and fix them before they cause any actual harm.
Chaos engineering has grown from a specialized practice to a common approach for boosting system reliability. In the past few years, companies like Netflix and Amazon have made it popular, and many others now use it as part of their plan to stay strong. But as Ben, the CEO of Steadybit, points out, chaos engineering isn’t just about breaking things for fun—it’s about making systems tougher by improving resilience.
Ben’s company has built a platform that makes chaos engineering accessible to organizations of all sizes. One of their clients cut downtime by 30% QoQ through continuous testing and learning from past incidents.Â
How Steadybit helps companies reduce downtime
Steadybit offers a user-friendly platform for chaos engineering that lets teams simulate outages, check their systems, and tackle weak spots before they cause problems. The platform works with Kubernetes and microservices-based setups so that users can run chaos tests in complex environments without much hassle.
Here’s how Steadybit has helped companies cut downtime by 30% QoQ:
- Start with historical incidents: Teams often look at past incidents to begin chaos engineering. Ben points out that teams start by examining recent downtime events from the past week or last quarter.Â
These events offer a good starting point for chaos experiments. If a particular service goes down because of a memory leak, a chaos experiment simulates the same conditions to check if the issue has been fully resolved. Steadybit lets teams recreate past failures and test their systems in a controlled environment.
- Manual experimentation and automation: After a team recreates a failure, they can run the experiment manually to see how the system reacts. If the system passes the test, the team can automate the experiment and add it to the CI/CD pipeline. This ensures the test runs before every major release, checking for the system’s toughness.
Automation plays a crucial role in cutting downtime over time. Steadybit offers API connections that let teams automate chaos experiments in their regular release processes. This ensures no critical deployment goes live without checking the system’s stability.
- Identify and address weak spots: Steadybit’s platform has features that help teams spot potential vulnerabilities, or “weak spots,” in their systems. The platform generates a landscape map that visualizes all the connections between services in a microservices architecture. This map is built using discovered data, allowing teams to see how different services interact and where potential bottlenecks or failure points might exist.Â
By visualizing the architecture, SREs, and DevOps teams can proactively address these weak spots, often before they cause real-world downtime. Steadybit’s focus on visualization helps companies understand complex dependencies within their systems, making it easier to manage failures.
Benefits of chaos engineering with Steadybit
The outcomes speak volumes. A Steadybit customer reduced incidents linked to resilience and reliability by 30% QoQ. This improvement wasn’t just a one-time fix—it was achieved through continuous testing, learning, and iteration. Similar results can be expected as more companies adopt chaos engineering as part of their standard operating procedures.
Here’s a rundown of the main advantages Steadybit provides:
- Reduced downtime through proactive testing and automation.
- Improved resilience by identifying and addressing weak spots in systems.
- Collaboration across teams, making chaos engineering a shared responsibility.
- Continuous improvement with automated chaos experiments integrated into the CI/CD pipeline.
- Safe, guided onboarding that helps teams start experimenting with confidence.
How to start reducing downtime
Cutting downtime by 30% QoQ is a realistic goal, and chaos engineering with platforms like Steadybit can help you achieve it. Start by analyzing your past incidents, running manual experiments, and automating successful tests to become part of your release process. Focus on collaboration, continuously improving your system’s resilience, and embrace a culture of experimentation.
By adopting a proactive approach to failure, you’ll minimize outages and build a more resilient infrastructure that can withstand the challenges of modern technology. Watch the interview on how Steadybit improves system resilience using powerful chaos engineering principles.