Today — the final day of CyberSecurity Awareness Month — we will be wrapping up our data security blog series on the NIST CyberSecurity Framework. Over the course of this series, my colleagues have taken time to highlight each of the framework’s core functions, including Identify, Protect, Detect, and Respond. I will close us out by addressing the framework’s fifth and final function, a topic that’s near and dear to my heart: Recover.
Despite all the effort you may put into protecting and responding to threats, the sad truth is you can never 100 percent guarantee that your organization will be immune to attacks. This is where recovery comes in. NIST does provide a great guide on this topic, which I highly recommend. In fact, I recommend reading their entire guide as you go about your research and planning. Here’s their starting point:
The Cybersecurity Strategy and Implementation Plan (CSIP)  defines recover as “the development and implementation of plans, processes, and procedures for recovery and full restoration, in a timely manner, of any capabilities or services that are impaired due to a cyber event.”
For the purposes of this post, it doesn’t make much sense to simply summarize their whole guide. Instead, I want to address a specific aspect of recovery, one you will need to carefully consider when developing and implementing your recovery plans, processes, and procedures: prioritizing workloads.
Let’s Talk: Workload Prioritization
I think it’s fair to say that every organization has their own unique needs and tolerances when it comes to downtime. For some, their business absolutely depends on many complex systems running at all times, while other companies may only need core services such as email and phone to limp along for hours or days without significant risk to the business. However, what’s most important is that you understand your business.
You must understand your organization’s’ IT systems and be able to determine all the critical components to a successful and timely recovery. It would be great to simply say, “Protect everything with a top-tiered service.” But for most companies, budgets are a major factor. More often than not, the solutions you choose to efficiently recover your environment are quickly boiled down to financial decisions. This is why prioritizing your workloads will be important. So, let’s talk about a typical prioritization of these workloads, and solutions that fit those needs.
Before we delve into the tiers, there are two components to recovery I want to define: Recovery Time Objective (RTO) and Recovery Point Objective (RPO). RTO is the amount of time it takes you to recover the server, infrastructure, and application. RPO defines how much data loss you can withstand or, more simply, how often items are being backed up and/or replicated.
Going back to prioritization, you can find anywhere between 3-to-5+ tiers of priority. But I have found that most of our customers can generally fit into a four tiered approach. Here is the general framework I like to use for those tiers.
These applications are absolutely critical to the business. When talking about IT infrastructure, these tend to be the core services that need to be running nearly all the time. Without these basic components no other services can run, so these are table stakes. Examples of these workloads would be networking and core applications, such as active directory and DNS services.
These applications keep the business running and should be considered mission critical. Generally, this set of servers or applications need to be running almost instantly. If they are not, it will put you and your stakeholders at significant risk. Typically recovery needs to happen within four hours, in other words an RTO of four hours. Also, the amount of data lost from lack of replication or backups should be no more than a couple of hours. Meaning, they should have a RPO of less than two hours. Examples include services critical to your business, such as retail Point of Sales solutions, online banking portals for financial institutions, or health care applications such as EPIC.
These applications and services may not be mission or business critical, but in order not to incur sustained losses they need to be up and running relatively quickly. These systems typically should be brought up in 24-to-48 hours. They can withstand a little more data loss, where the data may be up to 24 hours old. Examples of these could be billing systems, HR systems, and other items critical to your business process.
These applications are generally items that the business cannot do without, but are not critical to the minute-to-minute operations. They will not affect your customers and other stakeholders from doing business with you, but they are still needed around for your business to function properly in the long run. Normally, you can bring them online after your other systems have been restored, without the urgency that other systems may have. A typical RTO on these may be four-to-14 days, and while data loss should be kept to a minimum, your organization will not be out of business if you lose a couple days to a weeks worth of data. Examples of these systems include applications like training and LMS systems or historical document retention that is not covered under other compliance requirements.
Planning Your Cyber Recovery Strategy
Now that you have your systems prioritized, let’s take a look at how you can protect them. While I will focus on using a service provider for these, this can also be done by your staff — if your organization has the in-house experience and knowledge to run the necessary replication and backup systems.
For each of these priority tiers, you will want to plan to be running in a separate environment for an extended time. This means you should have either a separate data center or cloud ready to support recovery activities. The reality here is that most companies do not have the budget to run a mirrored data center or cloud in case of emergency. This is where a service provider, like 11:11 Systems, can help. We can host your recovery site data and workloads for a fraction of the cost of running a completely separate facility run by your IT teams. Regardless of your recovery strategy choice — DIY or using a provider — the priorities listed above still hold true and should be planned for.
For Priority 0 workloads, you will need some infrastructure in place that is up and running at all times. In traditional terms, I would call these the active/active sites. Services like Active Directory, DNS, and Networking should always be on and at the ready. The solution here will be to have your cloud running with enough live resources to support these core services, but not pay for the infrastructure until you actually use it. These Disaster Recovery as a Service (DRaaS) environments are always up, have the “lights on,” and are ready at a moment’s notice to be able to handle running production workloads. By utilizing a service provider such as 11:11 for DRaaS, you will have the ability to spin up compute when needed without having to devote a ton of internal resources, thus saving money and time. By only paying for the infrastructure when you use it, as opposed to pre-buying hardware and data center footprint that remains idle, you’ll be getting all the recovery reward without the full costs associated with it.
Priority 1 workloads rely on technologies that can recover your infrastructure in seconds or minutes. They also ensure you do not lose much data by having either asynchronous replication, or very short periods between snapshot replication. A good DRaaS solution will also have automation built in for things like re-IPing your servers, providing recovery of a group of servers, or even have DVR like features where you can wind back to any point in time very quickly. Technologies such as Zerto or Veeam’s DRaaS products are great solutions, for example. Another benefit of these services includes easy testing. For isolated events, DRaaS provides the ability to quickly test and remediate outside of your production environment, and then bring it back into production quickly. Honestly, that topic alone deserves its own blog. If you would like to learn a bit more about DRaaS, including the different options we offer at 11:11, you can read more here.
Priority 2 workloads are usually served best with backup technologies, and the ability to recover into your remote cloud. While traditional backup technologies do not provide the built-in automation for recovery, and their RPO’s may be a bit longer, the cost is generally cheaper. However for these infrastructure pieces, technology like tape backup will prove to be way too slow. Instead, a typical backup solution like Veeam or Cohesity fit well here. Of course, these will be running locally to your site, so you will need a third copy located close to your DRaaS environment. In this case, we suggest something like Backup as a Service (BaaS).
The final workload priority, Priority 3, do not have sensitive recovery times. While you will likely use traditional backup software to protect these, you will want to focus on the budget aspect for your offsite copies. In general, you will either use a legacy solution like offsite tape storage, or you can use something more modern like Object Storage with Immutability. I highly recommend storing these copies offsite with Object Storage because it is essentially air-gapped, but also cheap and much easier to pull your data back.
Recover(ed): Final Thoughts
As you can see, there are many items to consider when devising a proper recovery plan. So much so, that I’ve really only scratched the surface in this post. I look forward to writing more on this topic, and its many facets, in the future. Until then, the big takeaway should be this: When it comes to recovery, it is imperative to strike the right balance between the needs of the business and its budget. Maintaining such a balance requires proper planning upfront and the right recovery solutions. Of course, we are always happy to help jumpstart this process with you. For more information, I’d recommend checking out our recent CyberSecurity Awareness Month webinar, which, among other things, explains how 11:11 can help you create a security practice tailored to your organization’s needs. Or, as always, you can contact us directly.