Failover clustering is a fault-tolerance technology, minimizing service interruptions due to hardware failure or planned maintenance. In many ways, failover clustering has suffered from an image problem. Failover clustering works well technically, but its perceived configuration and maintenance complexities scare off many potential users.
Failover Clustering is just too hard to set up and use.
This is the single most common complaint I hear about failover clustering. This view stems from the pre-Windows Server 2008 days of high availability when creating a cluster was a fear-inducing procedure that required many pages of wizard input and huge amounts of configuration detail. Clustering generally required an expert, and you had to perform tasks on each node of the cluster. Once you'd actually created the cluster, maintenance was the next challenge and, once more, you probably needed a cluster specialist. All of this assuming you could actually get hardware that was on the cluster-supported list.
Microsoft went back to the drawing board with Server 2008 and started from scratch on many user interface elements, including management and cluster creation. The company also simplified hardware requirements to make clustering more accessible. Windows Server 2003 has a number of different quorum models to cater to different scenarios, such as the File Share Witness, which was needed for clusters with no common storage. The File Share Witness was initially required for Exchange Cluster Continuous Replication. Server 2008 merged all the different quorum models into a single unified model that could run in different modes but was far simpler to understand.
The cluster creation experience in Server 2008 consists of launching the cluster creation wizard and specifying the servers that will be in the cluster, a name for the new cluster, and an IP address if DHCP isn't configured on the NICs. That's it, three dialog screens in total. The cluster creation performs an analysis of the servers being added to the cluster, ascertains the availability of common storage, architects the right mode for quorum based on storage and number of nodes, and configures all of the nodes in one go. There's no need to go to each node to set up the cluster. Also, there's a validation stage as part of the cluster creation that checks your hardware and configurations. Assuming validation passes (which is likely, as long as your nodes are running the same processor architecture, version of Windows, and so on), your cluster is supported by Microsoft, with no need to check a Microsoft Hardware Certification List for your cluster or server hardware.
Ongoing management is just as simple. Any time you need to make a change, there are wizards to guide you through the modification. If you have a problem, running the validation again often gives good insight to the cause of the problem. This information is further improved with Server 2008 R2, and Server 2008 R2 also gives you full PowerShell management support for clusters.
I still experience some downtime with a failover cluster.
This is a common misunderstanding about failover clusters and can cause frustration. It's the distinction between high availability, which failover clustering provides, and fault tolerance, where failover cluster can be only a part of the solution.
Failover clustering provides a framework of capabilities that services and applications can take advantage of in different ways. At the most basic level, failover clustering keeps an eye on all the nodes in the cluster. If one node becomes unavailable, clustering moves the services and dependent resources from the dead node and distributes them through the rest of the cluster, onto the remaining healthy nodes. With this basic usage of failover clustering, you'll see some downtime when the node hosting a service or application crashes. That crash has to be detected. Then the resources that node had mounted, such as LUNs, must be mounted on a new target node, and the service or application has to be restarted. All of these steps take time, so the service will be unavailable for a while. This would be common for something like a file or print service that's hosted as part of a cluster. It's also the case with services such as Exchange Server 2007 and Exchange 2003 Single Copy Cluster. The key fact is that failover clustering technology will get the service restarted and available again as quickly as possible, providing high availability, but not 100 percent availability.
When people talk about fault tolerance, they're talking about a configuration that can tolerate a failure with no service downtime to the end user. Fault tolerant solutions typically require far more complex architectures than failover clustering, because they have to facilitate services running on multiple nodes at the same time. They also have to keep data synchronized between nodes in real time and provide failure detection and failover processes to minimize any downtime to the point that it isn't noticeable. The in-box failover clustering cannot do this for services and applications using Windows-only functionality because of the differences in implementation that are required for all the different ways applications can work.
Failover clustering provides the basic infrastructure that applications and services can build on to provide Fault Tolerant solutions, but that's not to say that application can't be fault tolerant without failover clustering. Many services are fault tolerant without failover clustering, such as Active Directory and IIS farms that use network load balancing.
A good example is Exchange 2010's Database Availability Groups (DAGs). DAGs use failover clustering behind the scenes for certain aspects of resource availability. They then add additional technology to replicate mailbox database data to multiple servers and provide client communication points in the form of Client Access Servers that present the data to the clients from the mailbox servers. If you're seeing short periods of downtime when a node fails, this probably isn't a problem—it's by design.