Are Availability Zones a Disaster Recovery Solution?

Editor’s Note: As of January 2022, iland is now 11:11 Systems, a managed infrastructure solutions provider at the forefront of cloud, connectivity, and security. As a legacy iland.com blog post, this article likely contains information that is no longer relevant. For the most up-to-date product information and resources, or if you have further questions, please refer to the 11:11 Systems Success Center or contact us directly.

It’s interesting to see that Microsoft Azure is previewing availability zones in two of their data center regions.

Amazon AWS has had this capability for quite a while, but what exactly is an availability zone, and does it provide full resiliency for applications in the cloud? Is it a good disaster recovery strategy?

For both cloud vendors, the original premise was to provide better availability and to protect against failure of the underlying platform (hypervisor, physical server, network, and perhaps storage).

For many years, VMware has provided high availability and resilience through capabilities such as DRS, HA, and vMotion. Storage vMotion came in with version 3.5.

To recap:

DRS (Dynamic Resource Scheduler) provides initial VM placement (which host the VM will run on) as well as moving VMs around a cluster based on usage to balance out the cluster. Typically, quiet VMs will be moved first.
HA provides the capability to restart VMs on other hosts in a cluster when either a host fails or a VM crashes for some reason (BSOD, etc).
vMotion provides the ability to move a running VM between hosts in a cluster with no loss of service.
Storage vMotion (svMotion) allows the underlying virtual disks (VMDKs) to be moved between datastores to achieve a more balanced throughput or allow migration of data.
VMware ESXi hosts can be put into maintenance mode, which uses vMotion to evacuate hosts prior to shutting them down. This is very useful for maintenance, upgrades, etc.

Now that we’re on the same page with capabilities, let’s dive into comparing the availability zones of AWS and Azure.

Crashes or Downtime

For the most part, AWS and Azure only provide an HA capability. In the event of a host hypervisor crash or deliberate shutdown, VMs will be restarted on other hosts as they have shared storage. They do not have a vMotion, svMotion concept, or DRS — although they do have an initial placement calculation. Although organizations running Hyper-V on-premises have live migration, this has not been carried through to Azure.

Maintenance

Aside from crashes, there is a problem with this model around planned maintenance. When hosts get updated, there is no way to move the VMs running on them without loss of service, so VMs will be subject to the rug having pulled out from under them on occasion.

With this in mind, AWS and Azure talk about a different model when designing resilient services, that of “Design for Failure”. In a nutshell, design your cloud infrastructure on the premise that parts of it will fail to provide resiliency at the application level. At the least, this requires doubling up of everything, and for many deployments, this will require additional licensing as well as costs for the VMs themselves.

When you think about the traditional virtualized applications of the past five years or longer, many solution architects and administrators have been very pleased with the high availability features offered by VMware and therefore have typically architected around single VMs for many applications. That said, people would often build out clusters for database applications, such as SQL Server, but that was often to provide the ability to upgrade the database one node at a time.

Storage

Aside from the compute elements of high availability, another area to discuss is that of persistent storage. In the past, storage was protected using RAID techniques. As we move to the public cloud, object storage has appeared as a popular way of storing data, and this uses the availability zone topology to protect data — but only if you choose it and pay for it. To protect against individual disk failure, three copies of the data are spread across the storage subsystems.

In the AWS world, both object storage (S3) and block storage (EBS) are available. For virtual machines requiring persistent storage, Elastic Block Storage (EBS) is used, and this is replicated within the availability zone to protect against failure of the underlying storage platform at no extra cost. EBS storage would not appear to be replicated to other regions.

In the Azure world, they do not have an EBS equivalent and instead use Azure Blob (Binary Large Object) storage. This is available in a number of guises depending on your use case:

Block Blob
Page Blob (used for VM storage)
Append Blob

For availability purposes, the following types of Blob storage are provided:

Locally Redundant Storage (LRS) – Three copies of the data exist in a single data center.
Zone Redundant Storage (ZRS) – Three copies of the data exist in one data center and another three copies in another data center. This can only be used for block blobs. The second copy is not available for use until Microsoft enables it for you.
Geo-redundant storage (GRS) – Three copies of the data exist in one data center and another three copies in a secondary region hundreds of miles away. The second copy is not available for use until Microsoft enables it for you.

In the case of Azure, having data replicated to another region does not mean that the VMs are necessarily available there (just the storage). They would need to be created from the underlying replicated storage (imported into Azure VMs).

Another important consideration is that replicating storage to another availability zone or region only really protects against storage subsystem failure. It does not protect against storage corruption, accidental deletion, or recent threats such as ransomware encrypting the files within the storage. To that extent, it is definitely not creating a disaster recovery solution.

What about 11:11 Systems?

In the 11:11 Cloud, an SLA of 100% availability is offered for a single data center solution with service credits being offered if that is not obtained. For customers wanting to implement a disaster recovery solution to protect against the other aspects mentioned above, 11:11 offers cloud-to-cloud DRaaS using Zerto between any of our data centers, depending on customer requirement. As with our on-premises to cloud DRaaS solution, this only involves paying for storage and inter-DC bandwidth most of the time and paying for CPU/RAM when invoking DR for testing or real scenarios.

When considering a true DRaaS solution, it is important to understand what RPOs and RTOs are required. That is, how much data can be lost due to an outage or corruption and how quickly you need to get things running again. With the 11:11 DRaaS solution, RPOs of minutes or even seconds are easily achievable, while the RTO is how long it takes to boot up the VMs. Another benefit of the Zerto-powered DRaaS solution is that it is a continuous replication solution with a journal supporting up to 30 days. This means that you can literally wind back to just before things went wrong, which is especially useful for ransomware attacks.

Another striking benefit of the 11:11 DRaaS solution is the ability to carry out self-service testing whenever required while replication carries on in the background. There is no need to call support to enable the replicated storage (as in the Azure example above). Everything is available to the authenticated/authorized user through the 11:11 Cloud Console.

From a compliance and auditing perspective, the 11:11 console gives you the ability to generate and access documentation to show what has been protected and how long the failover took. We also have an in-house compliance team that is available to answer any questions you might have.

In summary, as customers think about migrating their traditional virtualized services to the public, they need to consider the topics discussed above and whether they need to rearchitect their applications to accommodate the nuances of AWS and Azure. They also have the option to use a VMware-powered public cloud provider, such as 11:11 Systems, and enjoy the same capabilities they’ve been using on-premises with the added prospect of cloud-to-cloud DRaaS for additional protection against data loss or corruption.

Are Availability Zones a Disaster Recovery Solution?

Author: 11:11 Systems

Related Posts