DevOps is all about enriching your end-user experience. Continuous feedback to improve your applications is an important discipline. The faster you respond to (external) changes the better. Disruptions in production need to be solved as quickly as possible to ensure your end-users remain happy. Business-related KPIs such as the Mean Time To Recover are one of the most important KPIs to put into practice. Feedback about the state of your production environment is vital and playbooks help to close the feedback loop back to the application owner. Traditional playbooks are like lengthy (manual) procedures. Close your DevOps loop. Playbooks as Code automate the steps to recover from serious disruption.
Respond to an incident
Suppose an incident happens in production – how does the typical response in an organization look like? First of all, hopefully, you will notice the problem yourself and not your end-user. Whatever person or system detects the issue calls out for someone to fix the issue. Most likely, this is someone who is on-call.
This person investigates the issue and starts to fix the problem. If there are no (written or consistent) procedures that describe how to handle situations like this, the person carries out whatever comes to his mind to solve the problem. Whether it be a successful and timely restore of the application depends on his/her (domain) knowledge and experience with the (technical) issue. The follow-up of the incident varies from organization to organization and from person to person. Sometimes, the issue is reported and a proper root cause is carried out which is also documented in a (standardized) way. This helps to solve similar issues in the future. More often, things go ad-hoc and identical issues might occur again.
Valuable time is lost on the end-user side and on the organization side. Lose-lose in this case instead of win-win.
One way to improve the above-mentioned situation is to create standardized procedures. Playbooks describe the procedures so that every person which works on the incident follows the same approach. Sometimes, steps are automated. However, manual steps can be interpreted differently across individuals. There is no consistency.
Traditional playbooks do not encourage different departments/teams to work together on incidents. It’s a single person who handles the document and/or the procedures. It helps to recover from incidents but also slows you down. Updates to the playbooks impact multiple departments and teams. Approvals require management attention. It’s difficult to capture domain-related knowledge so the execution of these playbooks often depends on domain experts as well.
These types of problems increase the Mean Time To Recover, which is undesirable in a modern DevOps world.
Playbooks as Code
Incident response workflows can be automated (as much as possible) using playbooks as code. Big benefits of using this approach are:
- The entire incident response workflow is executed in a repeated and consistent way for every (similar) incident. No deviations across different persons who execute the playbook.
- Playbooks which are written in code are much more convenient to adapt over time – a culture of collaboration grows. Version control makes changes traceable and more fun to work on. No one likes to create think documents with procedures. Furthermore, updating these procedures is a nightmare which is almost always out of sync compared to reality. Collective ownership avoids the “blame game” in case a severe interruption overwhelms your organization.
- The quality of software applications increases since real-world feedback helps to propagate improvements. In addition to that, this is also available to Site Reliable Engineers and developers as well. Everyone views the same bits and pieces.
- Manual processes become a thing of the past. It helps to gain a higher return on investment and also fastens the Mean Time To Recover. Developer teams can focus on business-critical applications and more rewarding work. Besides this, it also helps to keep your developer workforce happy – as they might be less likely to seek a new adventure elsewhere.
Software tools and scripts help to put these advantages into practice.
Organizations can build their own playbooks from scratch or use existing playbooks which are already available on the internet. Some criteria which are relevant when selecting playbooks are:
- Every playbook should have a clear name
- The playbook should indicate the owner of it
- A (list of) steps to execute when the playbooks runs:
- A clear description per step
- The hosts/environments to which it applies
- The actual step/action to execute
- All steps should be in a logical order
Variables and secrets should not be stored inside the playbook itself but stored externally. This is especially true for secrets such as certificates to log in to a certain host as well as DB usernames/passwords etc.
If you want to get started, take a look at the playbooks for Azure, the playbooks from Ansible or the playbook examples from Stackpulse.
Typical playbooks are written in a simple scripting language like Yaml. This makes it easy for anyone even with a limited IT background to understand the scope and contents of the playbook.
Every organization needs to answer strategic questions like where to focus on and which actions to prioritize in a particular order.
Business considerations for playbooks (as code) include the following:
- How much do we benefit from the time spent on automation versus irregular incidents? If this is not enough – say only for incidents that never have a major impact, your time is not wisely spent.
- Which incidents occur most often, for example, similar incidents across multiple applications. If the impact (in terms of downtime and/or data loss) is big, this is a very good use case to start with.
- Decide to build your own playbooks or re-use existing ones. And on top of this: acquire commercial software to run your playbooks or build the CI/CD automation yourself.
- How to build responsible teams and get the right knowledge to handle follow-ups of the playbooks.
The management, together with IT architects and domain experts needs to closely work together to assess these considerations and include them in the strategic (DevOps) plans.
Before jumping at the source code of some good example playbooks, it’s good to realize to which problem they provide an answer. Consider the procedure which is described at the website of powerobjects. These manual steps can be automated as much as possible to speed up recovery time and to decrease the likelihood of someone making a manual mistake.
Developers in DevOps teams that put the “You build it, you run it, you own it” principle into reality face a steep learning curve if they want to deploy their application(s) to Kubernetes. Besides the regular/happy flows, they need to learn how to troubleshoot their cluster and applications if something goes wrong.
Stackpulse created a predefined playbook to extract the logs from a failed Pod (with one of multiple containers) and let the person in charge decide to re-run it or delete the problematic Pod. This way, there is no need to (manually) login to the Kubernetes cluster and you get a broader context of the problem since the playbook extracts the right logs. You can make decisions faster, also in collaboration with someone else. Finding root causes becomes much easier since these kinds of incidents can be logged and grouped together. Using this approach, it’s easier to identify patterns.
Another example comes from the Linux world. Often system administrators need to know how well their servers are performing. In traditional data-centers, server monitoring includes checking the CPU & memory usage. The Linux diagnostics playbook does exactly that…and more. Furthermore, this playbook constantly monitors CPU and memory consumption of Virtual Machines as well as (upcoming) shortages of storage capacity.
It sends out alerts to Slack in case something goes wrong. The Slack notification gets as many details as you want since you can easily change the playbook yourself if you want. You can then take immediate action to resolve the issue.
Wrapping up / conclusion
Closing the DevOps feedback loop is essential to unlock the full power of a continuous improvement circle. With playbooks as code, you can raise the bar a level higher. Find out what it means to your organization. I hope this article inspired you to take a look at them and push you forward in your DevOps journey.