In Part 1 of this article (“SharePoint 2010 Disaster Recovery, Part 1"), I discussed the various types of disasters that can befall your Microsoft SharePoint Server installation, as well as techniques to protect against those disasters. Part 1 focused on how to plan for and recover from content deletion disasters; in that discussion, I made the assumption that the infrastructure was functioning. In this article, I cover what happens when the infrastructure itself fails. This type of disaster includes machine outages, such as a SharePoint server crash or a Microsoft SQL Server machine crash, as well as facility failures. I also explain how to make your SharePoint farm highly available, including measures you can take to prevent the types of disasters I discuss.
Background
Any good technical article provides some background to help frame the discussion. In Part 1, I discussed the need to have a service level agreement (SLA) in place to define your disaster recovery expectations.
You also need a well-defined recovery time objective (RTO), which is a guideline for how quickly you must get SharePoint back online after a disaster strikes. For example, your RTO might state that when SharePoint goes offline, your objective is to get it back online in 4 hours. Having a defined RTO helps you shape your disaster recover strategy, as well as sets your customers’ expectations. A good rule of thumb is to round up. It’s better to under promise and over deliver than to over promise and under deliver. Sometimes the key to success hinges on lowered expectations.
Another important factor in ensuring successful disaster recovery is your recovery point objective (RPO), which defines the data that comes online at the time of recovery. In Part 1, I discussed RPO in the context of the point from which documents can be restored. For example, is the RPO midnight, when the backups ran? Is the RPO “no more than 2 hours old”? The RPO specifies the latest point in time to which we can recover.
To illustrate these points, it’s helpful to use an example. Suppose your RTO is 2 hours and your RPO is midnight of the previous day. If someone calls you at 1:00 p.m. to report missing content, you have until 3:00 p.m. (i.e., your RTO of 2 hours) to restore the data, which will be no older than from midnight of the previous night (i.e., your RPO).
Having a well-defined RTO and RPO is imperative to planning an appropriate disaster recover strategy. If you’re performing backups only at midnight and your RPO states that your customers will never lose more than 4 hours of work, then you have a conflict. To meet your RPO, you must increase your backup frequency—which will cost money at the very least, as well as possibly decrease performance. However, these tradeoffs are necessary to meet your objectives. In most cases, the shorter the RTO or RPO, the more money and management time it takes to achieve.
Machine Failure Outages
Now that we’ve covered the basics, let’s get down to the technical aspects of disaster recovery. The very least you can do in order to recover from a machine failure is to back up your databases. As the old saying goes, “Content is king”—and if you have all your databases, you have all your content. If you’re not already performing backups, rest assured that getting started is easy. (Remote Blob Storage—RBS—presents a database backup problem; for more information, see the sidebar “Remote Blob Storage Affects Database Backup.”)
For a crash course in performing backups of all your SharePoint databases, see my blog post “Scheduling SQL backups for SharePoint.” While you’re at it, go ahead and back up your SQL Server databases as well. These backups won’t take up much space, but they’ll really come in handy if you have to rebuild your SQL Server instance from scratch. Not only can you use these database backups to recover individual items, as I discussed in Part 1, but you can also use them to recover site collections, web applications, service applications, or an entire farm. Let’s walk through a few scenarios to see how.
SharePoint crashes. Suppose you have a typical small farm that consists of one SQL Server machine and one SharePoint server. As a good SharePoint administrator, you make backups of all your databases each night. You come in one rainy Wednesday morning to the cries of your users that “SharePoint is down!” After getting your morning coffee, you try to browse to SharePoint and realize, lo and behold, that it’s actually down. Not only is SharePoint down, it appears that the entire server is down. You can’t connect to it via RDP, and it won’t respond to pings—it’s just plain dead. You rush into the server room and you see your SharePoint server sitting at the boot screen, unable to find a hard drive to boot from. Whichever drive subsystem you had, whether a single drive, RAID 1, or RAID 5, it’s no longer working. The server and all its contents are gone. What do you do, besides verify that your resume is on a thumb drive in your pocket?