Windows IT Pro is the authoritative and independent resource for windows nt, windows 2000, windows 2003, windows xp. Features a collection of resources and magazines for windows IT professionals.
  
  
  Advanced Search 


June 2000

Best Practices for High Availability


RSS
Subscribe to Windows IT Pro | See More Exchange Server and Outlook Articles Here | Reprints | Or get the Monthly Online Pass—only $5.95 a month!

Master the UP time equation

A sizable portion of the analyst community has long been skeptical of Microsoft Exchange Server's ability to deliver a highly available messaging solution. In July 1999, GartnerGroup cautioned companies against consolidating smaller Exchange servers (e.g., systems that support fewer than 500 users) into large systems because of the difficulty of managing large databases and the poor performance that clients often experience when connecting Messaging API (MAPI) clients over extended WAN links. Exchange 2000 Server addresses many of the concerns that analysts have expressed. Better clustering, the partitioning of the Information Store (IS) into easily manageable databases and storage groups (SGs), and better integration with the OS all contribute to a more resilient service.

Advances in Windows 2000 (Win2K) and Exchange 2000 are important but are only part of the overall equation that determines how to deliver highly available systems. In this article, I want to reflect on how to achieve highly reliable systems with Exchange 2000 and Exchange Server 5.5.

An Uptime Survey
Before we look at an approach to uptime, let's look at what people are achieving today. In late 1999, the California research company Creative Networks conducted a survey of 63 companies. The companies' mean level of Exchange Server uptime was 99.6 percent. A full 56 percent of the companies were exceeding their uptime target. The survey left this target unstated, but it was probably higher than 99 percent. Clearly, this survey covered only a tiny portion of the Exchange Server installed base, but the 99.6 percent mean uptime level was greater than I expected. The survey also reported that the companies experienced an average of 71 minutes of unscheduled downtime per month, as well as an additional 112 minutes of scheduled downtime.

I want to know what problems caused the unscheduled downtime and why the administrators needed to take Exchange Server down on a scheduled basis each month. Installing service packs and hotfixes is a good reason, but I wonder whether some administrators are unnecessarily running Eseutil to compact databases. Unscheduled downtime of 71 minutes is a lot, especially if it occurs at peak times during the day, such as 9:00 a.m., when users are attempting to read their email after they've arrived at work.

The Uptime Equation
Table 1 illustrates acceptable downtime at different levels of availability. Typically, highly available systems seek to attain uptime of 99.99 percent or greater. Few Windows NT systems—let alone those that run Exchange Server 5.5—ever attempt to meet such a lofty goal. And software isn't the only underlying reason.

The classic equation that expresses the factors that determine uptime is

Uptime = Software + Hardware + Operations + Environment

This equation shows that a combination of software, hardware, operations, and operating environment determines uptime. The failure of any one of these elements affects uptime. You can blame Microsoft for bugs in NT or Exchange Server, but you can't blame the company if a hardware fault causes Exchange Server to stop, if an operator fails to perform a backup the day before a disk corruption occurs, or if a network failure stops messages from transporting across a backbone or to the Internet.

Companies that achieve high availability take a rigorous approach to mastering the uptime equation. As an example of how successful some companies are at achieving high availability, one company's OpenVMS servers were up continuously from 1981 to 1999—more than 18 years. Systems administrators took the servers down only to perform Y2K-compatibility checks on an application that controlled a manufacturing process. Clearly, those administrators operated on the If it ain't broke, don't fix it principle: They never applied upgrades or patches to the OS or the application, and they resisted the temptation to upgrade hardware components. The decision not to upgrade anything is a decision that you can probably make only on systems that operate within restricted networks (or no network at all) and in special circumstances. Given the number of service packs and hotfixes that Microsoft has issued for NT and Exchange Server over the past 5 years, the notion of keeping a server online all the time is difficult to fathom. You'd still be running Exchange Server 4.0 (with no patches) on NT 3.51 Service Pack 4 (SP4), you wouldn't be secure, and you wouldn't be Y2K-compliant. Of course, a comparison of OpenVMS and NT is unfair, largely because of the increased pace of development in both hardware and software today. In 1981, systems administrators typically saw one software upgrade and one new VAX computer per year. Now, new hardware debuts monthly, and service packs, hot fixes, and completely new OS releases (e.g., Win2K) occur frequently.

To build highly available servers, companies often plan their implementations based on the following simple principles:

  • Never deploy software without consideration. Carry out design and planning exercises to ensure that you can deploy software in a manner that delivers quality service.
  • Carefully test software before you put it into production. Test all aspects of the combination of OS, Exchange Server, service packs, and third-party software that will deliver the messaging service to users.
  • To protect your databases, always use high-quality hardware for Exchange servers, and pay careful attention to the disk subsystem (i.e., controller and disks). Monitor firmware updates to controllers and disks, and apply the updates regularly during scheduled maintenance. Be sure to protect the hardware from power surges or other electrical faults.
  • Take a highly disciplined approach to systems management and monitoring. Perform and verify backups. Scan event logs daily, and proactively identify any problems that might lurk in the background. Use regularly updated antivirus software. Run disaster-recovery exercises, and document the results. Record statistics monthly.
  • Pay close attention to any environmental variable that might affect the servers' smooth operation. Monitor the network, and do nothing to underlying parts of the infrastructure (e.g., DNS server, WINS server) that might affect client or server connectivity. Train users to make effective use of system and network resources.

Microsoft Classifications
In white papers such as "Microsoft Windows NT High Availability Operations Guide: Implementing Systems for Reliability and Availability" (http://www.microsoft.com/ntserver/nts/deployment/planguide/highavail.asp), Microsoft has attempted to bring these guidelines to the forefront. This paper defines six classifications for operational procedures: Planning and Design, Operations, Monitoring and Analysis, Help Desk, Recovery, and Root Cause Analysis. You could apply these classifications, which aren't unique to NT, to any enterprise-level OS—they lay down the foundation of a plan to achieve high availability.

Planning and Design. Clearly, you need to start with a good design, and you achieve a good design only through detailed planning. As Win2K and Exchange 2000 debut, you might want to assign a special team (e.g., employees responsible for the network, namespace, OS, and applications) to perform the detailed design work. All too often, a company's network team generates one design while the OS team generates another—and both designs are supposed to work seamlessly together to form a basis for application deployment. The unfortunate applications team typically gets the dirty end of the stick—this team must work within the constraints of design work that the other teams have performed without regard to the application team's requirements. Because Active Directory (AD) integrates data from the OS and basic functions such as DNS with data from AD-integrated applications such as Exchange 2000, AD's design and implementation require that different teams work together to create a unified design. Companies that generate plans based on all needs will achieve much higher reliability than companies that expect fractured planning to automatically gel.

Operations. Drawing up a set of operational procedures is a good start. However, you must execute the procedures to achieve success.

Monitoring and Analysis. If you master the basics—such as performing full daily online backups, running Performance Monitor to keep an eye on the system, and checking the event logs for unexplained errors—you'll experience less downtime and fewer system outages than if you simply trust that Microsoft software never goes wrong.

Help Desk. You need to establish well-defined escalation paths so that people know what steps to take in problem resolution when the first potential solution fails. Systems administrators perform first-level resolution, but what happens when you find a backup tape that contains bad data that you can't restore? Who takes care of directory or public folder replication that doesn't work? Who sorts out a DNS problem that prevents servers from finding one another, thereby halting message routing?

Recovery. A tested disaster-recovery plan is an important component of your escalation procedure. As a speaker at the 1999 Microsoft Exchange Conference said, "An untested disaster-recovery plan is simply a set of well-organized and documented prayers."

Root Cause Analysis. Mainframe and minicomputer administrators are accustomed to analyzing why problems occur so that they can take measures to prevent the problems from recurring. This discipline comes from the earliest days of computers, when CPU time and memory were precious resources that you didn't want to waste with buggy programs or with insufficient or inaccurate operating procedures. Are we taking the same care with our Windows systems? In May 1997, the late and lamented Byte Magazine concluded that NT administrators are extremely prone to taking shortcuts to get systems back online after an outage and don't take the time to understand why the problem occurred. Perhaps this behavior is a throwback to Windows' history, when a quick reboot was often the only way to stop a looping program or to regain memory or other system resources that were gently leaking away. The temptation is to reach for the power switch. However, cycling the power not only hides the root cause but can also generate new problems. Win2K is more complex than NT, and the relationship between the OS and applications is deeper than ever. Going for the quick fix doesn't make sense, particularly if you're still learning how the software works. Instead, systems administrators need to understand why problems occur and how best to address them.

Set Your Priorities
You might wonder why a messaging system needs to attain an uptime record greater than 99.9 percent. The obvious answer is that email is like your telephone—users expect it to be available all the time, just like a dial tone. Even so, efforts to attain greater than 99.9 percent uptime might be a matter of expending much effort for little gain. Email isn't as time-critical as other applications. A material-wasting computer failure in a manufacturing plant's production process is an obvious example of a situation in which uptime is cost-critical, but a messaging system hardly falls into that category. Losing a messaging service irritates users and slows the delivery of some messages, but it's hardly the end of the world. Phone or fax messages can always cover the outage. Alternatively, you can maintain a free MSN Hotmail account for those occasions when your corporate email system is unavailable.

A system is a set of pieces that fit and work together. If you expect a system such as Exchange Server to work perfectly all the time, particularly if you concentrate on only one piece of the puzzle—such as the hardware design for servers—you're headed for a disappointing experience in which you'll probably lose some data. Mastering the uptime equation is essential to a successful messaging-system implementation.

End of Article



Reader Comments

You must log on before posting a comment.

If you don't have a username & password, please register now.




Top Viewed ArticlesView all articles
Command Prompt Tricks

One reader shares his tip for setting up the command prompt to reflect a remote path. ...

Have New Features Made Exchange Server Backups Unnecessary?

Cluster continuous replication and Volume Shadow Copy Service might have made backups unnecessary in Exchange 2007, but will admins feel comfortable without a dedicated backup solution in place? ...

PsExec

This freeware utility lets you execute processes on a remote system and redirect output to the local system. ...


Exchange Server and Outlook Whitepapers Protecting (You and) Your Data with Exchange Server 2007

StoreVault SnapManagers for Microsoft Exchange and SQL Server

Related Events Check out our list of Free Email Newsletters!

Exchange Server and Outlook eBooks Spam Fighting and Email Security for the 21st Century

Understanding and Leveraging Code Signing Technologies

The Expert's Guide for Exchange 2003: Preparing for, Moving to, and Supporting Exchange Server 2003

Related Exchange Server and Outlook Resources Become a VIP member of the Windows IT Pro community!
Get it all with the VIP CD and VIP access. A $500+ value for only $279!

Subscribe to Windows IT Pro!
Solve your toughest technical problems with our experts and access 10,000 + articles online. 30% off

Monthly Online Pass - Only $5.95!
Get instant access to 10,000+ articles from Windows IT Pro Magazine!

TechNet Virtual Labs
Evaluate and test Microsoft's newest products.

Exchange & Outlook UPDATE eNewsletter
News, strategies, products, and developments in Exchange Server and Outlook messaging.

Windows IT Pro Home Register FAQ for Windows WinInfo News
Europe Edition About Us Contact Us/Customer Service Media Kit Affiliates / Licensing  
SQL Server Magazine Office & SharePoint Pro Windows Dev Pro IT Job Hound ITTV
IT Library Technology Resource Directory Connected Home Windows Excavator Windows SuperSite 
 
 Windows IT Pro is a Division of Penton Media Inc.
 Copyright © 2008 Penton Media, Inc., All rights reserved. Terms and Use | Privacy Statement | Reprints and Licensing