Subscribe to Windows IT Pro
June 13, 2001 12:00 AM

Endurance 6200 3.0

Windows IT Pro
InstantDoc ID #21140
Rating: (0)
Fault tolerance beyond clustering

Fault tolerance means different things to different people. According to a broad definition, fault tolerance ensures that an application is always available to its users. For example, if a problem occurs with an application on one server in a clustered server scenario, another server takes over. But although clusters provide high availability for applications, they don't satisfy my definition of true fault tolerance because the application's recovery from a system failure isn't always transparent to users.

Clustering actually has several drawbacks. To be fully functional in a clustered setting, applications must be cluster-aware. Because developers must include code that supports clustering and failover, a typical off-the-shelf Win32 application doesn't fully support clustered operation. In a clustering scenario, the failover—and failback—process isn't perfect: When an application fails on a server, existing user sessions disappear, forcing users to reconnect to the application after it moves to a new server. If the application relies on the server to maintain session state information, the client's session state information is lost.

Marathon Technologies has developed a new approach that doesn't suffer from the shortcomings of clustered servers. In this Lab feature, I take a close look at the company's Endurance 6200 3.0 fault-tolerant server array.

A Unique Architecture
The Endurance architecture separates application processing and application I/O on two different computers and provides a redundant backup for each. The resulting set of systems—or array, in Marathon terminology—can continue application processing uninterrupted after any individual component fails.

As Figure 1, page 78, shows, an Endurance system consists of two pairs of interconnected Windows NT 4.0 servers that appear to the user (and the application) as one system. Each pair—or tuple, in Marathon terminology—includes one dual-processor Compute Element (CE) and one I/O Processor (IOP). All I/O devices (e.g., hard disks, network adapters, CD-ROM drive, 3.5" disk drive) physically connect to the IOPs. The CEs have only one standard I/O device: a 3.5" disk drive intended for firmware updates. The two IOPs also must have identical configurations, with the same NICs and RAID controllers in the same slots and identical disk configurations. Each computer uses a Marathon Interface Card (MIC)—a proprietary 50MBps, full-duplex, low-latency, dual-port card that interconnects each IOP with both CEs. In each IOP, you must also install another standard NIC (preferably Gigabit Ethernet) that Endurance can use to mirror data from one IOP to the other during installation and during the recovery phase following the failure of an IOP. You can separate the two tuples with as much as 500 meters of multimode fiber-optic cable.

Both tuples run your application at the same time, all the time. This method is the key design difference between an Endurance array and server clusters. If any element of the array fails, the application will continue to run uninterrupted on the rest of the server array. Within a tuple, the CEs perform all application processing. The IOP handles all I/O. This design insulates the CEs from any failures that disk subsystem components might induce. However, poorly written application code can still cause an application to fail. And because applications run on both CEs simultaneously, an application that fails for this reason will fail on both CEs.

Although both IOPs have a NIC connected to the same network segment, only one NIC is active at a time. If the active network connection fails, the standby takes over, using the same IP address and the Endurance-assigned soft media access control (MAC) address. Endurance lets you configure each IOP with as many as four NICs.

Endurance's underlying architecture is fairly complex. The system redirects network I/O from the active network connection to both CEs. Application I/O requests originate on one CE, and both CEs can redirect the I/O requests to both IOPs. I/O is synchronous—the system doesn't signal I/O write completion back to the application on the CEs until both IOPs have completed the write. Marathon reports that because of the involved overhead, the throughput of an application running on an Endurance array ranges from 85 to 90 percent of the throughput when the application runs on one server.

To let both CE processors execute the same instruction at the same time, the CEs must be identically configured, down to processor, BIOS, and firmware revision levels. The requirement that both CEs have exactly two processors—no more and no less—is a limitation of the current Endurance implementation. Marathon expects to support four-processor systems by the end of this year. Such support should make Endurance a more attractive platform for fault-tolerant database servers.

Endurance requires Intel-based servers. Marathon can provide you with a list of tested hardware configurations. In general, the system supports equipment from the major server manufacturers, and Marathon's Web site advertises systems based on Hewlett-Packard (HP), IBM, Dell, and Compaq hardware. The Endurance array that I tested required NT Server 4.0 with Service Pack 6 (SP6) or SP6a. A Windows 2000 version will be available by press time, and Marathon plans to support both OSs for the foreseeable future.

Related Content:

ARTICLE TOOLS

Comments
    There are no comments to display. Be the first one!
You must log on before posting a comment.

Are you a new visitor? Register Here

advertisement

advertisement

Windows is a trademark of the Microsoft group of companies. Windows IT Pro is used by Penton Media Inc. under license from owner.