Editor’s Note: As of January 2022, iland is now 11:11 Systems, a managed infrastructure solutions provider at the forefront of cloud, connectivity, and security. As a legacy iland.com blog post, this article likely contains information that is no longer relevant. For the most up-to-date product information and resources, or if you have further questions, please refer to the 11:11 Systems Success Center or contact us directly.
There’s been a lot of talk around the water cooler lately about extremely large VMs, the pain of backing them up, and the agony of restoring these monster VMs should the need arise. When we say agony, we are thinking of something along the lines of a root canal without pain killers …
We all love that the price of storage has become so cheap that everyone seems to have at least a terabyte of photos and two or three terabytes of movies these days – your 75-year-old father included. Gone are the days when server admins religiously set user storage quotas and monitored file shares for .jpg and .mp3 files. Why bother? It costs more to monitor, find, and delete the offending files than it costs to simply add more space.
Or does it?
Running monster VMs in your environment can be cool to brag about, but they are certainly not without peril. We recently embarked on an internal testing project to determine how big is too big from a recovery point of view using traditional backup methods. For the testing use case, we created a 21TB VM with 14 attached VMDKs stuffed full of data.
First, some technical and environmental information for the test setup: the test environment is VMware ESXi 5.5 on Cisco UCS. The backup server is a Cisco B200-M3 blade with dual Intel Xeon E5-2630 CPUs at 2.3Ghz and 24 cores, 128GB RAM. The server is running Windows Server 2012 R2 and Veeam version 9 with two additional physical proxy servers to assist with the backup/restore processes. The storage repository is a Cisco 3160 with 400TB of space on 60 * 8TB disks running RAID60 connected by 2 * 10GBps fabric connections. The restore will be made to a 25TB volume running on a Compellent SC8000 array with 576 * 800GB 10K disks.
Okay, now that we got that out the way, onto what we are here for: what does it take to successfully recover a 21TB VM?
Creating a full backup took one and a half hours, which was about what we expected it to take. Once it had completed, we immediately set about creating the restore job. During the restore wizard, all of the VM’s data and configuration were checked by Veeam, which took longer than we expected. Once that little delay was over, we moved along through the rest of the wizard, and with great excitement, we clicked START! And nothing happened. For a long time. The nerd excitement waned, and we went home.
After some time, Veeam showed that the job was still at 0%, but in vSphere, we could see disks were being created, so things were definitely moving along. Sadly, later we were still at 0% and creating disks in vSphere. No actual data had been copied yet because, by default, Veeam running restores in SAN mode creates all the VMDKs and formats them as Thick Provision Lazy Zeroed first. Once the disk creation has been completed, the actual restoration of data begins. Note that it is possible to change the disk restore method to Thin Provision or Thick Provision Eager Zeroed by creating a new registry entry on the Veeam server, which will effectively make this first process go very quickly.
One could argue that you would not need to do a full restore if you had an OS corruption issue. You could just reload the OS on a new partition and reattach the existing VMDKs, or simply use the file-level restore capability to restore the files you need. But with the rise of ransomware and large-scale data corruption, you may be left with no choice but to restore from an older backup.
Perhaps the main lesson to be learned from these tests is this: At some point, a VM really can be too big—regardless of its disk configuration—IF you ever need to restore it. When that day comes, you and the other business owners need to be aware that the restoration is definitely not going to be a fast process. Based on our experience, it’s likely that telling someone that they may have to wait more than an hour is going to be a hard sell to anyone needing access to the data.
So what is the solution? Consider migrating the data from your single giant VM to several smaller VMs or leveraging snapshots with backup software integration. This has the added benefit of distributed VMs running their workloads on separate hosts in the cluster and, most likely, improved performance. Additionally, it solves the issue of a single point of failure should your single 21TB file server crash. And finally, if one 5TB server needed to be restored, it would likely take orders of magnitude less time to restore than a single 21TB VM would take. More importantly, your entire environment would not be offline, so few people would be lining up at your desk or burning up your phone asking for a status update.