Don’t feel like reading? Listen to this article via the player below:
In the beginning, there was the computer room and things were easy. Nobody really considered things like Disaster Recovery, but you had a local backup server that used agents to backup your servers to a tape backup system. As things evolved the tape backup morphed into a set of tape machines clustered into a single entity that captured your entire environment.
A Disaster Recovery History lesson
Doing a ‘Disaster Recovery’ meant a trip to a location where you would restore your tape drives to a set of hired machines, thereby finding out that the hired machines had different firmware or hardware that was incompatible with your production environment or the tracking on the tape device that undertook the original writing of the date was off and the DR Company’s tape machine could not read the recovery tapes. This resulted in a failed DR attempt, which at the time, was acceptable to the auditors as the corporate policy stated that the company must undertake a DR Test every 12 months. There was nothing in there to state that it had to be successful.
As capacity increased, new problems needed to be solved. Tape became too slow for the majority of companies to endure if everything fell apart and disaster struck and the industry moved to virtual tape libraries: disk-based storage arrays that pretended to be a physical tape library. While this did shrink backup windows and recovery windows, this didn’t solve the incompatibility issues with hardware.
Virtualization made DR testing successful
Disaster Recovery stability improved rapidly with the introduction of virtualization technology. The isolation from hardware and the encapsulation of the operating system into a small set of standard files did away with most of these compatibility issues.
All that was now required for a DR test or actual recovery was a couple of machines running the relevant hypervisor and the ability to remotely access the environment. Suddenly DR tests had to be successful, not just undertaken, and the false comfort of the audit tick box was removed.
As a result, we focused in on concepts like recovery time objective and recovery point objective: the time it takes to recover a service (RTO) and the amount of data you are willing to accept as a viable loss (RPO).
With the move to the cloud as a delivery platform, you would imagine that things would become even moving away from the one-size-fits-all, rather inefficient way of restoring the entire workload.
Infrastructure-as-code can deploy your environment to a new region or availability zone with little or no modification to policies or procedures.
Disaster Recovery in a Cloud world
Now things start to get a little murky here; what exactly is Disaster Recovery in the cloud?
For many datacenter-oriented providers such as Veeam, Zerto, CommVault, it means DR-as-a-Service: a cloud-like service where you copy your VMs to, storing the copies in their vault with the option to restore your on-prem VMs to a 3rd party datacenter, like a service provider. Whilst this is a valid paradigm and an excellent lift and shift migration strategy, we’re not going to dive into this kind of Disaster Recovery.
Instead, we will look at is the concepts and considerations required for DR when you’re fully ‘in the cloud’.
- What would happen to your business if the region of your Cloud provider of choice where you have your environment set up goes down?
- What happens if the entire Cloud provider went dark?
- What if they have a fatal corruption in their storage service?
The 3-2-1 rule applies to cloud, too
DR rules are compatible across local, hybrid and pure cloud deployments. The 3-2-1 Backup Strategy ensures you have at least 3 copies of your data, stored on at least two different media types, with at least one location being remote and/or air-gapped.
Three copies— the original data, and two clones — provide protection from human error, such as accidental deletion of data. Keeping data on at least two different kinds of storage devices makes it less likely for data to be lost due to a hardware or software fault. In a traditional environment this would been a second backup to disk and a third to tape which was stored off-site, in a hybrid environment it would most likely have the third copy in a cloud store.
However, what would you do in a pure cloud environment? Different stores replicating to different regions? Perhaps, but architecturally you would logically fail for not having a copy of your data offsite. True, the chances of a global cloud provider having a total global outage and you being unable to gain access to your data is slim, but what about a situation where your root credentials have been stolen, changed and you have lost administrative access to your data? Doesn’t sound so far-fetched, right? This is exactly what happened to CodeSpaces after they lost access to their AWS environment.
What happens if your boutique Cloud provider or MSP goes bankrupt? Or their DataCenter goes up in flames – a fledgling provider Online HR Services went dark when the Buncefield Oil Refinery blew up. Why that happened is best shown in a picture.
There are a number of relatively simple things that can prevent the sort of issues discussed above; like using 2-factor authentication on AWS accounts and making sure data is not locked in a single datacenter or cloud region / availability zone.
Responsibility lies with you, not the cloud
Herein is the issue at its crux. Who is responsible for your data? You or your cloud provider? The cloud provider has a duty of care to look after it as it is stored there, but unless you have a DR plan with them the vast majority have statements in their contracts limited liability. They are responsible for trying to keep the data online, but it’s you who is responsible for resilience and data protection.
So let’s revisit the 3-2-1 rule. This paradigm in a true cloud environment can be problematic especially if you take it to its natural conclusion: Egress charges. It is simple and cheap to get your data into a cloud provider but expensive to get it out:
Before Disaster Recovery – What can be done
So what exactly can be done to provide three site resilience when you are fully in the cloud.
As we are already aware, Infrastructure as Code can simply deploy repeatable infrastructure to your target environment. Cloud Storage costs are relatively cheap. For example, if you are currently in AWS, you could have your third data location in Azure. This location would be obviously be protected by different credentials and the storage location protected with assured deletion technology (requiring verification by another to authorize any deletion).
If in the event of a major disaster, your invocation strategy will be to deploy your infrastructure into Azure and attach your third site. Now this is obviously a very high-level strategy and this article was meant to get you thinking about cloud-based disaster recovery.
Taking out the elephant of Egress charges out of the room this process makes eminent sense.
Other considerations
One major consideration for many Cloud consumers is that of Data Sovereignty. This is where your data is encapsulated in the sovereign country. If you live in the USA this is not a significant issue for you as each of the major cloud providers have many points of presence. However, if you are living in Europe or any of a number of other locations for example the middle east where AWS only has a single Region in Bahrain arranging three locations is significantly more difficult, however substituting your third site for a local site and deploying in two availability zones would also satisfy the site location paradigm
Testing your DR – it is important when using IaC as a constituent part of your Disaster recovery process to keep your code in sync, and when testing new builds that the DR build is also tested. This is especially true when deployment is to a different cloud.
Make sure that your code, like the custom AMI images, the deployment code, any secrts, application binaries and all other artefacts use for deployment are safe, not stored in a single location and backed up.
Summary
DR and Business continuity are the opposite sides of the same coin, the rules that you followed when you were running your own infrastructure, on locally deployed or co-location datacenters have not really changed, however the way that you think about them and the solutions that you put in place to recovery your business have evolved. Hopefully you will have had some food for thought as you continue your cloud and cloud-native journey.