Subscribe to Windows IT Pro
March 24, 2010 12:00 AM

4 Failover Clustering Hassles and How to Avoid Them

Windows IT Pro
InstantDoc ID #103534
Rating: (11)

Why can't I have a cluster over more than one location without expensive network solutions?

Cluster-enabled services typically have a number of resources allocated to them, including an IP address. Within a single location, you can have multiple nodes connected to the same network segment, or at least network segments that can be in the same IP subnet. This means the IP address for the service can be hosted on any node in the cluster, because they all have the same network connectivity capabilities. Now imagine you want to spread a cluster with nodes in multiple locations. Multiple locations typically means different network segments and IP subnets. This is a problem because you can't have a cluster resource IP address of 192.168.1.10 being hosted in a location that is subnet 192.168.10.0, the routing just wouldn't work. The solution to this problem has been to stretch subnets across multiple locations, which typically involves very expensive network implementations, prohibiting all but the largest companies from using clustering in multi-site scenarios. 

      Server 2008 introduced a key change that brought multisite clustering to everyone, and it can be summed up in one word: OR. Before Server 2008, you could allocate multiple IP addresses to a service or application as part of the resource group, but all the IP addresses had to be present—they all had to be functional on all nodes in the cluster. The Server 2008 introduction of OR means you can allocate multiple IP addresses to a service or application and specify an OR relationship. The OR lets you allocate multiple IP addresses to cater for the various IP subnets the service may run on in multiple locations. The IP address that matches the location where the service is currently active is used for client connectivity, which now means you can have multi-site clustering without the expensive network solutions. 

      Just because you can allocate multiple IP addresses in an OR relationship doesn't mean all your multi-site problems will be magically solved. When you have a single IP address for a service, the clients always know the address to talk to the service. If you have multiple IP addresses for a service, the solution is more complicated. You may need to use services, such as DNS, with very short Time to Live values on the hostname records, so clients don't cache old IP addresses, or use the option to register all IP providers so all IP addresses are registered in DNS. More likely, you may use some kind of middle communication tier for the clients, such as (going back to the Exchange 2010 example) the Client Access Server role. 

 

I have high availability at the virtualization and application levels. Which should I use? It's just confusing.

I get this question a lot, but there's a basic piece of guidance that will help you make the right decision. The trend today is towards virtualizing everything you can, and the major virtualization solutions offer high availability services that work in both planned and unplanned situations. In a planned situation, for example, you might want to install a patch that requires a reboot to a Hyper-V server. You can use the Hyper-V Live Migration function to copy the memory and state of the running virtual machines (VMs) to another Hyper-V server and avoid any VM downtime.

      Unplanned scenarios, where the virtual server just crashes, don't give you time to copy the VMs' memory and states to other virtualization servers, so the VMs have to be restarted on a new virtual server in a crash consistent state. The services offered by the VMs will be unavailable while the guest OS boots and the services start. So, with virtualization you have the option of high availability at the virtualization level, but with unplanned server downtime, you'll have a period of unavailability.

      The alternative is to enable high availability within the guest OSs using traditional technologies, such as failover clustering, with the applications. This requires that the applications support failover clustering. If they do, application-aware high availability will generally offer far less downtime then would be associated with restarting the OS (which you have to do with virtualization high availability).

      Consider an Exchange mailbox server that's made highly available through the virtualization layer and one that's made highly available within the guest OS. When using virtualization high availability, you install one instance of the Exchange mailbox server role in a VM, with its configuration and virtual hard disks on shared storage. You make the VM highly available through the virtualization features (in the case of Hyper-V, failover clustering is actually used on the Hyper-V hosts). If the server hosting the VM crashes, another server will restart the VM in exactly the same way a physical box has to reboot after a crash. There would be a possibility of disk and database corruption due to improper shutdown, so it may need to run integrity checks, which can be very slow. This scenario is illustrated in the left side of Figure 1.

 

Figure 1   

  If you instead employ Exchange's high availability features, illustrated in the right side of Figure 1, which use failover clustering in the guest OSS, you have two instances of the Exchange mailbox server role (with Exchange 2010, you can have up to 16 in a cluster or DAG). It's critical that each instance be on separate servers—you're not adding much benefit hosting both instances on the same physical box. You should add anti-affinity rules to make sure the instances don't run on the same box. You don't need to use shared storage.

      Each instance runs the Exchange software. Logs are shipped from the active copy of the database to the passive copy and replayed there, keeping the databases synchronized. If the server that's hosting the active copy crashes, the guest OS will see that the active Exchange mailbox server is no longer responding and take ownership of the mailbox server IP and name resources. It will try to copy any missing transaction logs, check with hub transport to make sure no messages have been lost, and start offering mailbox services from its own copy of the database. This way is much faster and cleaner than high availability at the virtualization layer.

            In general, if you're running an application that supports high availability, such as Exchange or SQL Server, it's better to enable high availability at the application level within the guest OSs to achieve the optimum high availability. If you have an application that doesn't support high availability, enabling high availability at the virtualization layer is the next best thing.

Related Content:

ARTICLE TOOLS

Comments
  • Donoghue
    2 years ago
    Mar 25, 2010

    Your description of the improvements in clustering is informative and helpful, unlike your description of fault tolerance. Failover clustering is not fault tolerance because it is failure-recovery technology, as you describe in your section covering downtime. Fault tolerance is failure prevention technology and a crash as you describe it is transparent to the application and end user. There is no failover, no downtime, no data loss. Now to your point about fault tolerant architectures being more complex than clusters. A fault tolerant server arrives at your data center. You unpack it and plug it in twice. Thats pretty much it. Yes, the equivalent of two x86 servers is in the enclosure. They run in complete lockstep using one OS license and one application license, unlike multiple licenses required for each cluster node. Windows and Linux applications are ready to run out of the box with no modification. So does VMware. The server monitors itself, protecting against transient errors and more serious component and CPU failures. The server itself takes faulty components out of service while the duplicate continues to operate with no performance degradation and no impact on the application. The server calls back to the service center, reports the problem and, if needed, orders its own replacement part. Replacement parts are hot-swapped by the user, again with no downtime. Ongoing management doesnt get simpler than that. Clusters are good for a lot of applications, but not for truly business- or mission-critical applications. They will never achieve uptime of 99.9999 percent as fault-tolerant servers do. Im with Stratus Technologies and, after 30 years, we know fault tolerance.

You must log on before posting a comment.

Are you a new visitor? Register Here

advertisement

advertisement

Windows is a trademark of the Microsoft group of companies. Windows IT Pro is used by Penton Media Inc. under license from owner.