Subscribe to Windows IT Pro

 

Get Newsletters

  • Get the Latest News
  • Product Updates
  • Helpful Tricks
  • Productivity Tips

Subscribe Now!

March 24, 2010 12:00 AM

4 Failover Clustering Hassles and How to Avoid Them

Windows IT Pro
InstantDoc ID #103534
Rating: (10)

Failover clustering is a fault-tolerance technology, minimizing service interruptions due to hardware failure or planned maintenance. In many ways, failover clustering has suffered from an image problem. Failover clustering works well technically, but its perceived configuration and maintenance complexities scare off many potential users.

 

Failover Clustering is just too hard to set up and use.

This is the single most common complaint I hear about failover clustering. This view stems from the pre-Windows Server 2008 days of high availability when creating a cluster was a fear-inducing procedure that required many pages of wizard input and huge amounts of configuration detail. Clustering generally required an expert, and you had to perform tasks on each node of the cluster. Once you'd actually created the cluster, maintenance was the next challenge and, once more, you probably needed a cluster specialist. All of this assuming you could actually get hardware that was on the cluster-supported list. 

      Microsoft went back to the drawing board with Server 2008 and started from scratch on many user interface elements, including management and cluster creation. The company also simplified hardware requirements to make clustering more accessible. Windows Server 2003 has a number of different quorum models to cater to different scenarios, such as the File Share Witness, which was needed for clusters with no common storage. The File Share Witness was initially required for Exchange Cluster Continuous Replication. Server 2008 merged all the different quorum models into a single unified model that could run in different modes but was far simpler to understand.

      The cluster creation experience in Server 2008 consists of launching the cluster creation wizard and specifying the servers that will be in the cluster, a name for the new cluster, and an IP address if DHCP isn't configured on the NICs. That's it, three dialog screens in total. The cluster creation performs an analysis of the servers being added to the cluster, ascertains the availability of common storage, architects the right mode for quorum based on storage and number of nodes, and configures all of the nodes in one go. There's no need to go to each node to set up the cluster. Also, there's a validation stage as part of the cluster creation that checks your hardware and configurations. Assuming validation passes (which is likely, as long as your nodes are running the same processor architecture, version of Windows, and so on), your cluster is supported by Microsoft, with no need to check a Microsoft Hardware Certification List for your cluster or server hardware. 

      Ongoing management is just as simple. Any time you need to make a change, there are wizards to guide you through the modification. If you have a problem, running the validation again often gives good insight to the cause of the problem. This information is further improved with Server 2008 R2, and Server 2008 R2 also gives you full PowerShell management support for clusters.

 

I still experience some downtime with a failover cluster.

This is a common misunderstanding about failover clusters and can cause frustration. It's the distinction between high availability, which failover clustering provides, and fault tolerance, where failover cluster can be only a part of the solution.

      Failover clustering provides a framework of capabilities that services and applications can take advantage of in different ways. At the most basic level, failover clustering keeps an eye on all the nodes in the cluster. If one node becomes unavailable, clustering moves the services and dependent resources from the dead node and distributes them through the rest of the cluster, onto the remaining healthy nodes. With this basic usage of failover clustering, you'll see some downtime when the node hosting a service or application crashes. That crash has to be detected. Then the resources that node had mounted, such as LUNs, must be mounted on a new target node, and the service or application has to be restarted. All of these steps take time, so the service will be unavailable for a while. This would be common for something like a file or print service that's hosted as part of a cluster. It's also the case with services such as Exchange Server 2007 and Exchange 2003 Single Copy Cluster. The key fact is that failover clustering technology will get the service restarted and available again as quickly as possible, providing high availability, but not 100 percent availability.

      When people talk about fault tolerance, they're talking about a configuration that can tolerate a failure with no service downtime to the end user. Fault tolerant solutions typically require far more complex architectures than failover clustering, because they have to facilitate services running on multiple nodes at the same time. They also have to keep data synchronized between nodes in real time and provide failure detection and failover processes to minimize any downtime to the point that it isn't noticeable. The in-box failover clustering cannot do this for services and applications using Windows-only functionality because of the differences in implementation that are required for all the different ways applications can work.

      Failover clustering provides the basic infrastructure that applications and services can build on to provide Fault Tolerant solutions, but that's not to say that application can't be fault tolerant without failover clustering. Many services are fault tolerant without failover clustering, such as Active Directory and IIS farms that use network load balancing. 

      A good example is Exchange 2010's Database Availability Groups (DAGs). DAGs use failover clustering behind the scenes for certain aspects of resource availability. They then add additional technology to replicate mailbox database data to multiple servers and provide client communication points in the form of Client Access Servers that present the data to the clients from the mailbox servers. If you're seeing short periods of downtime when a node fails, this probably isn't a problem—it's by design.

 

Related Content:

ARTICLE TOOLS

Comments
  • Donoghue
    2 years ago
    Mar 25, 2010

    Your description of the improvements in clustering is informative and helpful, unlike your description of fault tolerance. Failover clustering is not fault tolerance because it is failure-recovery technology, as you describe in your section covering downtime. Fault tolerance is failure prevention technology and a crash as you describe it is transparent to the application and end user. There is no failover, no downtime, no data loss. Now to your point about fault tolerant architectures being more complex than clusters. A fault tolerant server arrives at your data center. You unpack it and plug it in twice. Thats pretty much it. Yes, the equivalent of two x86 servers is in the enclosure. They run in complete lockstep using one OS license and one application license, unlike multiple licenses required for each cluster node. Windows and Linux applications are ready to run out of the box with no modification. So does VMware. The server monitors itself, protecting against transient errors and more serious component and CPU failures. The server itself takes faulty components out of service while the duplicate continues to operate with no performance degradation and no impact on the application. The server calls back to the service center, reports the problem and, if needed, orders its own replacement part. Replacement parts are hot-swapped by the user, again with no downtime. Ongoing management doesnt get simpler than that. Clusters are good for a lot of applications, but not for truly business- or mission-critical applications. They will never achieve uptime of 99.9999 percent as fault-tolerant servers do. Im with Stratus Technologies and, after 30 years, we know fault tolerance.

You must log on before posting a comment.

Are you a new visitor? Register Here

advertisement

advertisement

White Papers

Get your Windows 7 deployment off to the right start by implementing PC lockdown. A locked-down environment is easier and cheaper to support since users are less likely to make unnecessary changes to the core system configuration - read more here!

Essential Guides

Is your iSCSI "lossy"? The reality is that most off-the-shelf Ethernet hardware deployed for iSCSI can lose packets, resulting in slow performance or application downtime. Learn how to assess your current iSCSI infrastructure and engineer an advanced iSCSI SAN infrastructure.

Web Seminars

What's the best way to keep your network safe from malware? In this web seminar, security expert Greg Shields suggests an alternative method to the traditional blacklisting approach that is common with anti-virus and anti-malware solutions.

eLearning Series

We bring the experts direct to you to share their real-world perspective and expertise. During each event, three sessions stream in real time, so you can learn, ask questions, and get solutions.
Upcoming event: Getting the Most with Exchange 2010 with Paul Robichaux

Subscribe to Windows IT Pro!

Windows is a trademark of the Microsoft group of companies. Windows IT Pro is used by Penton Media Inc. under license from owner.