Subscribe to Windows IT Pro
January 14, 2010 12:00 AM

Disaster Recovery Plan Testing 101

Don’t let a disaster be your first test of your recovery plan
Windows IT Pro
InstantDoc ID #103400
Rating: (3)

You’ve written your disaster plan and distributed it to your staff. You’ve included all the points required for a decent plan: assigning a disaster recovery team and coordinator, creating detailed recovery procedures and instructions and call trees with employees and vendors. You’ve covered hot topics such as pandemic planning and long term back-up power. Now you sit back and wait for a disaster to show how ready your systems are for it, right?

Not so fast. True, you’ve done the hardest part, which is getting your plan ready. But how ready is it? One way to know is to put it through its paces and test it before it’s really needed.

But it’s not enough to test your backup tapes once in a while or cycle your generator once a year. Testing your disaster plan thoroughly involves testing all systems, people, and processes for their readiness and resiliency to help you see the gaps in your plan (and every plan has gaps). Testing also verifies that the information in your plan is correct, and it lets you improve your plan over time so that it’s a living document, not a dusty binder on a shelf.

So where do you get started? Let’s talk about the various levels of disaster recovery testing routines and the proper process to follow when testing your plans so as to get the most out of your testing.

Ways to Test Your Disaster Recovery Plan
Just as there are many types of disasters (natural, man-made, and the most common and most likely, hardware or software failure), there are many types and levels of disaster plan testing, ranging from simple to more involved (and typically more expensive). If you haven’t done much disaster recovery planning, you can start off with the simple methods and work your way up. Eventually you can develop a multi-year plan to do fully integrated tests of your disaster recovery plan every year.

Disaster Recovery Checklist. Develop a simple checklist and walk through it to make sure that every item is in place. This is not unlike the hurricane preparedness checklists that those of us on the Gulf Coast consult every year during hurricane season.

Your list might contain items such as “Generator working—check;” “Fuel for generator stored safely or under contract—Check;” “Backup tapes stored off site—Check.” This is a simple process that you can do with minimal time and staff involvement. You should be doing this step no matter what.

Disaster Recovery Walk-Through. Kick your disaster recovery testing regimen up a notch by involving your staff and walking through your disaster recovery plan with all key players present. Do a simple group reading of your plan, making sure everyone is aware of all elements. It also gives staff members the opportunity to ask questions and voice concerns about the plan.

As simple as it seems, many companies fail to do this basic step. Making sure everyone has read the plan in a group setting is vital to understanding and retention so everyone knows what to do when the time comes. Call trees should be walked through to make sure they make sense. Vendor lists and other information should be examined to make sure all the data in the plan is up to date.

Disaster Recovery Tabletop Test. This test extends the walk-through test, adding staged scenarios to see how the plan would work in real-world circumstances. Forcing your team to actually discuss what they would do in certain circumstances puts stress on the plan and can show the gaps.

By throwing these scenarios at your staff, you can see how your plan allows for different circumstances and unexpected situations. The mock scenarios can go from simple to actual simulated situations. Don’t publish the scenarios beforehand, but spring them on your staff. This mimics the way a real disaster comes upon us.

You can also come up with certain “curve balls,” such as a faulty generator or a backup failure. See what happens when things don’t go according to plan. In my tabletops tests, I’ve had groups draw straws to remove selected staff members who are “incapacitated” by flu or some other pandemic to see how the plan reacts to the impacts of staffing losses.

When the person who knows everything is suddenly unavailable (as often happens in a disaster), who takes over and does he or she have access to everything that’s needed? These are questions a tabletop test can answer about your disaster recovery plan.

Disaster Recovery Technical Tests. Here is where disaster recovery testing gets interesting: No more meetings in conference rooms—you’re testing real systems in real-life situations. It can run the gamut from simple backup media tests to complex, hot site operational switchovers.

This is where you find out what systems you can recover successfully, according to your written plan. Most companies do some level of technical testing but many could do a lot more. Let’s examine technical testing in depth and look at some guidelines for testing.

Technical Testing Your Disaster Recovery Plan
There are two levels of technical testing: parallel and live. Parallel testing is where you back up or restore a system that’s running parallel to your production system, so you don’t affect any regular processing. This is the safest way to test your technical systems.

However, it does require that you already have redundant servers in place or are willing to fund spare servers. And parallel testing doesn’t truly assure you that you can recover the production system.

Live testing involves actually downing the main system and attempting to recover it. This type of testing is also known as “full interruption” testing. It gives you a true measure of a system’s recoverability. However, it’s expensive in terms of down time, and it’s risky: What if you can’t recover the production system?

Some situations won’t allow a true live test since a failed test could be as bad as a real disaster and cause lives to be at risk. For example, some healthcare, government, and military systems (e.g., air traffic control), can’t be live tested due to public safety or regulatory concerns.

You can do technical disaster recovery tests on many different systems, though you usually don’t test all your systems at once due to the risks and complexity. Most companies rotate their different technical tests, doing one a quarter or bi-annually so that they get through all technical system tests every year or two. Here are some basic types of testing to include in your technical disaster recovery testing regimen:

Backup media restoration. There are two main ways to test backup and restore. The first one involves doing random data item restores such as restoring a few files from selected file folders. This tests the integrity of your backup media. You should do this with some regularity and not wait for a formal disaster recovery test, though you’ve probably already done this on the job for some hapless employee who deleted something by mistake. However, don’t wait for the opportunity—schedule it in with your normal weekly or monthly log review.

The second type of testing involves actually restoring an entire server. This ensures you’re backing up everything you need and in the right manner. Sometimes a server has to be restored in a particular order (OS first, then database, then application program). Often, complicated programs such as SQL Server or Exchange Server don’t react well to being put on different hardware or OS versions than they were on originally.

Restoring a server can involve two different levels of difficulty: You can restore onto an existing similar server or do a bare-metal restore, restoring totally from scratch with only your backup media to work with. Using ghosting or disk image software can make this process a little easier.

Both of these methods of restoring a server require some kind of back-up program and backup servers to work with. And both of them will reveal any flaws in your backup plan and show additional complications or time that it might take to do your restores. During your testing is when you should find these things out, not when the building burns down with your servers in it.

Related Content:

ARTICLE TOOLS

Comments
    There are no comments to display. Be the first one!
You must log on before posting a comment.

Are you a new visitor? Register Here

advertisement

advertisement

Windows is a trademark of the Microsoft group of companies. Windows IT Pro is used by Penton Media Inc. under license from owner.