
What are 99.9% PC/LAN server up-time and availability worth to you? More to the
point, can you afford to bet your business on Windows NT?
Many companies have their LAN, databases, and all other business functions
on NT systems. But companies such as financial institutions question whether NT
is ready for prime-time, mission-critical applications. When you rely on
computers for your accounting, product development, human resources management,
data management, and now sales through the Internet, your systems must be
operational 24 hours a day, seven days a week. Failure is not an option.
Clustering, which has been around in Unix and VMS for more than 10 years,
is one technology for achieving near 99.9% server up-time. By letting you
duplicate a mission-critical system, this technology guarantees availability, so
you can bet your business on your OS.
Now clustering is coming to NT. Although this technology is not on the
grand scale of its Unix or VMS predecessors, clustering offers functionality
heretofore unknown to PC operating systems and represents a big step for NT
toward availability worthy of those major-league, mission-critical enterprise
applications. By having two computers instead of just one to support a task, you
double your chances for meeting the goal of 99.9% server up-time.
Clustering 101
Before I get into the specifics of Digital's cluster solution, let me
explain some clustering terminology: load balancing, primary server, failover
(or secondary) server, failover, and failback. You can set up each server so
that all five terms apply to it.
For example, suppose you use SQL Server for your accounting and order
fulfillment departments and you have two databases that you want to protect by
implementing a cluster. In a single-cluster environment (two servers), you can
manually load balance--divide the work between the two servers--by
installing SQL on both machines. Make one the accounting database's primary
server--the system with principal ownership and management responsibility
for a resource--and the other system the ordering database's primary server.
Then, set up each system to have a primary disk (or disks) on the shared storage
array (a chassis housing shared disk drives where cluster software stores and
shunts data between systems). This disk will serve as the database device. So
far, this configuration is no different from setting up two independent servers,
except that the shared disks are on a subsystem physically connected to both
servers.
Now, you set up the cluster by configuring each machine to be the other
machine's failover (secondary) server--the system that will inherit
ownership and responsibility for a resource--to the other. So when one system
(the primary server) goes down, it will fail over--relocate cluster
services or resources from the faulty system to the operational one. Its
resources move to the failover server, and the service (such as a database)
keeps running. When the primary server comes back online, the service will fail
back--automatically migrate cluster resources from the failover server back
to the primary server.
The failover server is not just a cold standby server (as with Novell): The
server performs meaningful work and provides more than disk-mirroring or
single-system availability through hot-swappable disks. The open architecture of
both the software and the off-the-shelf hardware means that you have
scaleability built in. You can add disk storage almost ad infinitum and
functionality with more CPUs and peripherals such as printers and tape drives.
Digital's Configuration
Digital Clusters for Windows NT is two servers, a network connection,
cluster software, and an external disk array with SCSI adapters. (Although the
1.0 product release supports hardware-based RAID, the Digital BA356 storage
subsystem doesn't. A future product release will have a built-in RAID 5
controller. Also, version 1.0 does not support software-based RAID
through NT.) A key feature of this clustering solution (Digital will contribute
this feature to Microsoft's Wolfpack standard--Mark Smith explains Wolfpack in,
"Closing In on Clusters," page 51) is that Digital's clustering can
use off-the-shelf hardware for disks, network cards, and SCSI controllers.
Digital will officially support only its listed hardware (AlphaServer 1000,
1000A, 400, 2000, 2100, 2100A, 4100, Prioris ZX Pentium, Prioris ZX Pentium Pro,
Prioris HX, Prioris XL), but the software works on other systems, too. You can
use any two servers running NT Server 3.51 with Service Pack 4, but the
clustered CPUs must be the same. You can't mix Intel with Alpha because of
differences in how the NT File System (NTFS) handles file tags (information on
permissions, groups, etc.) and page logs on Intel and RISC platforms. The two
clustered systems don't have to be similarly configured (one can be a dual
Pentium and the other a quad Pentium Pro), but on each machine, you have to
install the same software (SQL Server, Oracle7 Workgroup Server, or any other
application) you intend to fail over from one system to the other.
The disk array is a BA356, which is part of the Prioris kit you buy from
Digital, without disk drives. This standard external storage chassis has a
multichannel-capable, Fast and Wide, differential SCSI-2 backplaneyou can
have as many SCSI channels on it as you have drives and controllers in your two
servers. You can set up the disk array to be either in the middle of the SCSI
chain between the two servers or at the end. Where you put the array depends on
whether you leave the terminators installed and whether you use what Digital
calls a trilink SCSI adapter. This adapter is a Y connector from the
disk array to the two servers. You can order a standard cluster kit from Digital
that comes with cables, terminators, and an Adaptec 2944W Fast and Wide
differential SCSI-2 controller for each server.
The network connection is just a medium for a heartbeat between the two
machines. The heartbeat lets each machine know the other is alive. If one
disappears, the failover begins, and the remaining system takes over all
assigned functions.
This connection can either be through a dedicated direct connection with a
basic 10Mbit Ethernet card, or you can go through your usual high-speed LAN
connection. Beware of using your usual LAN, because your domain controller and
competition for your Ethernet media can introduce extra delays that can add to
the 20- to 30-second failover time. Also, a failure in the part of your network
between the clustered machines will initiate a failover: Each cluster machine
will think the other is dead, so the clustering software on each server will
drop ownership of the disks and leave them offline to prevent data corruption.
Digital recommends a direct, standalone connection between the two servers for
best performance.
The cluster software is where all the magic occurs. This software acts as a
shim--new code that the software adds without disrupting existing OS code. The
software provides the means for the SCSI drivers and the network layers in the
OS to carry out the clustering capabilities. The software also has an
administration tool for setting up drives, failover scripts, and other
characteristics of the cluster (such as its network alias and administrator
login and password). The logic behind the cluster's operation is complex, but
the user and administrator aspects are simple.