SharePoint server architecture in both Microsoft Office SharePoint (MOSS) and Windows SharePoint Services (WSS) lets you create a robust, fault-tolerant, and highly available SharePoint farm designed to survive the loss of any one component. But it’s not readily obvious how to do this out of the box, and some of the guidance doesn’t cover all availability concepts.
To further complicate things, there’s a great deal of confusion about the difference between disaster recovery and high availability. High availability generally refers to the concept of keeping an application or service running and available for use in the event of a failure of part of the infrastructure, while disaster recovery refers to a process of recovering an environment that has already failed.
As this article specifically focuses on high-availability concepts, let’s dive into SharePoint high-availability concepts first, then look at some prescriptive guidance for making components in a SharePoint farm fully redundant and highly available.
Understanding SharePoint Server Role Availability
The base architectural component in a SharePoint environment is the SharePoint farm, composed of multiple servers that work together to store content and display it for end users. Each server in the farm can hold one or more server roles that determine what job the server plays in the farm topology.
For example, the web role utilizes Internet Information Services (IIS) to display content for users, while the index role is responsible for indexing content so that it can be made available for search. To gain a full understanding of SharePoint high availability, let’s examine each role and how it works.
Database Role Availability
The database server role, which uses Microsoft SQL Server 2008 and 2005 to house crucial SharePoint databases, can be made highly available by traditional Microsoft Cluster Service (MSCS) failover clustering. If a cluster node were to fail, the second node in the cluster would take over the database role seamlessly.
Clustering is a complex topic, but to simplify, all nodes in a particular cluster have direct access to a shared storage location (such as a SAN disk volume) where the databases are stored and can constantly communicate with each other to take over in the event of an outage. SQL Server 2008 running on Windows Server 2008 is highly recommended as it has the most functional, easy-to-configure clustering options.
A strong SQL Server recommendation for a SharePoint environment is to use a combination of a DNS CNAME record or a SQL Server alias for SharePoint servers to connect to, rather than the actual name of the SQL Server server or the cluster. This gives you the flexibility to move SharePoint databases to another SQL Server instance in the event of an outage or for general housekeeping.
By using an alias name to connect to (i.e., spsql.companyabc.com), admins can save themselves the headache of having to go through Microsoft’s documented procedure for moving to a new SQL Server instance, which involves a command-line operation (stsadm –renameserver) and a full reindex.
Web Role Availability
To achieve high availability of the SharePoint web role, load-balance the traffic sent to multiple web role servers by using a hardware load balancer or Windows Network Load Balancing (NLB). Load-balanced web role servers share virtual IP addresses (VIPs) so that, in the event of a failure, the traffic sent to the VIP is sent to an available host.
A few caveats exist with NLB for use with SharePoint, however. First and foremost, be sure to enable site affinity, also known as “stickiness,” which forces users to use a single server for their session, unless that server is down. This reduces issues caused when a client’s session is sent from one server to the next.
If using software NLB, be aware of two caveats associated with the type of NLB configured. With multi-cast NLB, routers must be specially configured or the packets will be dropped. Uni-cast NLB doesn’t require this special configuration but does require a dedicated NIC for the intra-array traffic. The servers communicate heartbeat information to each other across the dedicated NIC, which can reside on the same network as the standard NIC.
Query Role Availability
The query role provides search results that are pulled from the full-text index used by SharePoint Enterprise Search. Multiple query role servers can be utilized in a farm, and referrals to them for searches are made directly from the web role servers.
What this means is that query role servers don’t need a technology such as NLB to be made redundant; instead, simply having more than one query role server allows for search functionality to be made highly available. One caveat associated with the query role is that it can’t be made highly available if it resides on the same SharePoint server as the index role component.
In other words, if you place the two roles on the same server, then SharePoint will no longer propagate a copy of the index to any other location, even if you try to make another system a query server. The only way to effectively make Search highly available is by subsequently deploying a dedicated index server, then adding the query role to at least two other servers so that the index will be propagated and will be made available in the event of an outage.
Index Role Availability
The index role is the only SharePoint role that can’t be made highly available, but since the loss of index functionality isn’t immediately noticeable, this might not be an issue. If the index server is down, Search will still work as long as there are available query servers in the farm.
The only noticeable effect would be that new items added to SharePoint or other content sources wouldn’t show up in search results until the index server was rebuilt or recovered and indexing continued.
SharePoint Central Admin Role Availability
One commonly overlooked role from an availability perspective is the SharePoint Central Admin role, which can be easily made highly available but often is not. Central Admin, which is used to administer SharePoint, is simply a SharePoint web application that’s connected to a dedicated site collection in a dedicated SharePoint content database.
You can make it highly available in the same way that you would make any other web application redundant in a SharePoint environment. Unfortunately, Microsoft doesn’t make this obvious, but the high-level steps involved in making the tool redundant include the following:
1. Turn on the SharePoint Central Admin role for a second server in the farm, typically a second load-balanced web role server.
2. Change the registry setting on SharePoint servers that defines which address to use for Central Admin: in this example, a load-balanced Fully Qualified Domain Name (FQDN) of http://spca.companyabc.com:8888. This will also change the default address that the local SharePoint server uses when clicking on the link to start Central Admin.
The registry setting for this example is as follows: HKLM\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0\WSS\CentralAdministrationURL (REG_SZ) = http://spca.companyabc.com:8888
3. Change your default Alternate Access Mapping (AAM) for the SharePoint Central Admin web application to http://spca.companyabc.com:8888.
4. Add a DNS “A” record that points spca.companyabc.com to a load-balanced IP that corresponds to both SharePoint servers (either hardware- or software-based NLB will work).
Note that in addition to load-balancing Central Admin, you can also enable SSL encryption and Kerberos authentication, and assign a standard port (443) for the HTTPS traffic. Microsoft not only supports these configuration changes but also recommends them for security and availability.