As technology has become more pervasive within almost every type of business, the concept of having a concrete plan to respond to problems becomes more critical. In the past, “disaster recovery” was primarily a business function that focused on things that could bring a halt to the business, keep people away from their office, or make the site inaccessible. These events were typically “acts of god,” like severe weather, floods, or something similar.
Technology has changed all of that. Now, in addition to natural disasters, there are man-made events that can seriously impact the ability of the business to function. These could be accidents, like a construction crew cutting the network fiber to a site, a pipe bursting in the computer room, or even someone inadvertently shutting down a critical piece of equipment. Other man-made events are more nefarious, such as denial of service (DoS) attacks, malware, viruses, and ransomware. These types of attacks commonly come from outside, but it’s not uncommon for them to originate inside the walls of the company, from a disgruntled employee.
Ransomware in particular has been in the news frequently in recent months, from attacks on health systems to local and state governments. These events can effectively shut down a business as it tries to recover.
As a result, the concept of Business Continuity is included in the latest ITIL release. If you look at many IT job postings, you’ll see that it is often being included as a key responsibility.
Despite that, many organizations still struggle with the idea of how to build an effective business continuity plan (BCP). Hopefully, this article will help lay a basic foundation.
Building the Business Case
Despite everyone recognizing that recovery from a business-halting event is important, it may still be necessary to sell the idea of having a formal business continuation process. To that end, you may need to build a business case that quantifies the impact of these events. As with most business plans, the most effective approach is often driven by cost.
Obviously, there are levels of response, based on specific events. You can choose to focus on the worst-case scenario, meaning the complete shutdown of the business—this would involve payroll costs, real estate costs, loss of revenue, and so forth. But often the worst-case scenario is harder to picture, and the reality is that while it’s possible, those type of events are few and far between.
A better approach would be to simply look at one type of event. For example, since there are plenty of examples of ransomware attacks and the costs they incur, focus on that type of scenario. Lay out the tangible costs—the ransom amount, cost of lost productivity, costs of lost business, and so forth. Then list the intangibles, such as the public relations impact or the potential effect on the stock price.
Explain how this type of attack works—that it can simply be an innocuous email and someone clicks a link—and how this malicious software can travel through your network and attack anyone who is online. Point out that these events get worse over time. The longer the downtime, the more costly; the longer the delay in response, the more likely that the infection will continue to spread or the cost to recover the data may go up. The point is not to frighten anyone, but rather to help develop a healthy respect for what can happen if there is no documented response.
Explain what the approach will be (as documented below) and how important having leadership sponsorship is to the success of the plan. Once you have the go-ahead, you can move on to the next step.
Building the Business Continuity Plan
The first step is building the BCP teams. These can be broken into three different sets of participants:
- BCP Committee
- BCP Response Team
- BCP Consultants
For each group, there should be a distribution list and a call tree. Keep in mind, in a continuation event, not everyone may have access to email or voicemail, so a clear communication plan is necessary. The distribution lists can take a number of forms, and for best coverage using more than one is a best practice. This can be done via email, voicemail, text message, social media, or direct calls to someone’s mobile or home phone.
This is where the call tree becomes effective. Instead of one person trying to reach out to entire teams, a tree structure is created where specific contacts call others, who in turn call their contacts, and so forth. This is a straightforward means of contacting everyone that should work across almost any sized organization; in larger, more complex environments it may be possible to automate the notification process.
While some people may cross from one to another, the makeup of the three teams is different, as they have different functions.
The BCP Committee is made up of individuals representing key functions. These should be people who have the authority to make decisions that impact their areas of responsibility, preferably managers or their delegates. Each of these individuals will own their functional areas’ specific response to a BCP event. There should be two people designated as the overall decision-makers for the BCP: a primary and an alternate. Additionally, it often makes sense to have two people assigned as document owners—a primary and an alternate—who are responsible for maintaining the actual contents of the plan.
The BCP Response Team is made up of people who are activated in the case of an event. Using the primary/alternate model, key sites would have specific representatives, key functional owners would be represented, and the overall decision maker would be in this group as well to make on-the-spot decisions as events unfold. There are really three areas that need representation: critical sites, critical applications, and critical vendors. I’ll discuss defining these areas later.
Finally, BCP Consultants are those individuals who either need to be kept informed of progress, such as members of senior leadership not on the above teams. Other consultants might be people that have specific skills or knowledge that may need to be called upon depending on the type of event, or it’s duration, or other predefined criteria. This could be other IT partners, business leaders, or even vendor representatives.
One of the jobs of the BCP Committee will be to create definitions for what is considered critical. This will vary depending on things like the size of the business or the industry.
Critical Sites. How a critical site is defined depends on what is housed at the site. In a large, distributed company not all sites may be considered critical. This could be linked to the census—the larger the site, the more critical it may be—or it could be linked to function—a small site with a critical function. For example, a data center or server farm would be considered critical even though there are fewer people housed there. For each critical site, there should be a clear response plan. If the site is critical because of census, document how the plan addresses keeping those employees working in the event the site becomes unavailable. Is moving to another site a possibility? Telecommuting? Document the response within the plan. If the site contains a critical function, the documentation should lay out the business impact of losing that function over time. The business might be able to survive a day, but beyond that, costs start to go up and the potential for loss grows. Are there workarounds or other sites that can pick up the slack? The document should identify them and lay out the process for the workaround or for moving the function. If there is only one site, determine if there’s a possibility that a vendor could provide similar services, both in the short term and over the long term.
Critical Applications. These are applications that, if lost, will have a measurable negative impact on the business. For most enterprise productivity applications, there are alternatives and workarounds. For example, there are a number of productivity tools that offer word processing, spreadsheets, and presentation software. It becomes more difficult when trying to find a workaround or alternative to an internally developed application. In this case, it would be beneficial to have an off-premise instance of the application that is isolated from the normal business network, that could either be reinstalled or spun up by a designated vendor. As part of the plan, document the impact over time, establishing when it would become necessary to reinstall or utilize an outside source.
Critical Vendors. Every business has vendor partners that are important. If one of your vendors has an event that impacts availability, it’s important that there’s a documented workaround or alternative. In some cases, there’s an obvious choice. For example, there are a variety of shipping companies that could provide similar services, or there are a number of computer hardware manufacturers that could offer machines built on a common architecture. However, other partners that offer products or services that may not be readily available might be harder to replace. This emphasizes the importance of having a documented business continuity plan, as you can do that research ahead of time and determine how you might respond in the case of an event.
It shouldn’t need to be said, but it’s vitally important that critical business data is backed up, either via an on-premises application or through a reliable vendor. This data should be securely moved off-premises to a secure space on a regular basis. It should be isolated from the normal network to avoid corruption in the case of targeted attacks.
It shouldn’t need to be said, but it’s vitally important that critical business data is backed up.
For each of these critical areas, part of the documentation should be a time-based checklist. This details what happens at each stage of an event. There should be clearly documented steps that occur when the BCP is activated.
In the case of natural disasters, sometimes there is warning—a hurricane or severe snowstorm, for example. In that case, the plan could be activated ahead of the event, at Day minus x, to prepare or harden a site, move a function, coordinate with vendors, etc.
If there is no warning, then the plan is activated, and the checklist provides direction on what needs to happen, in what order. For instance, upon activation, one of the first steps is getting everyone together via the call tree. The team can assess the severity of the event, and based on that, make further preparations.
From there, the plan documentation should make an attempt to lay out what needs to happen over time. This is where specific actions are documented if service is interrupted for a day, several days, a week, two weeks, a month, all the way to a site, application, or vendor being unrecoverable. This time-based documentation should estimate the impact to the business and attempt to detail how much functionality can be returned at each step of the plan.
A key function of the BCP committee is to review and keep the plan up-to-date. At minimum, there should be a quarterly document review that identifies changes. If the plan is kept current, these quarterly meetings shouldn’t take much time, only a quick evaluation to determine if site contacts have changed or if any of the applications or vendors need to be updated.
Implementing the Plan
Finally, a best practice is to have regular practice in implementing the BCP. This means testing the distribution lists and call trees and documenting the response. It means that you need to hold regularly scheduled simulations where the BCP Response Team works through a scenario. This should happen at minimum once a year, and it would be better to have it twice yearly. These practice sessions would provide a unique scenario—some I’ve participated in were fun exercises like a Zombie Apocalypse or realistic simulations of a ransomware attack. Use what’s popular or currently in the news, or look at events that could be specific to your region (hurricanes, earthquakes, etc.). This allows the team to work through the checklist and document areas for improvement.
This is a lot to take in, but the reality is this article only scratches the surface of building a business continuity plan. If done properly, a BCP can be invaluable even in the case of minor incidents and a huge benefit in the unfortunate event the full BCP has to be activated. Being prepared is the key to success or failure.
Mike Hanson has many years of experience with IT leadership, having managed several different aspects of technology over the past 30 years. Today, he serves as global operations manager and leads the asset management and fulfillment teams for Optum, Inc. He has been involved with HDI for many years, as both a local chapter officer and as a past chair of the HDI Desktop Support Advisory Board. Follow Mike on Twitter @Mike_MiddleMgr.