Subscribe to Windows IT Pro
January 14, 2010 12:00 AM

Exchange Server and Uptime: The Search for More 9s

Windows IT Pro
InstantDoc ID #103439
Rating: (2)

In some businesses, there's always pressure for increased uptime of the messaging and other systems. I've worked with law firms, financial organizations, and other customers for whom time really is money, and their focus is often on squeezing the most possible uptime from their Microsoft Exchange Server organization. With that in mind, I wanted to start discussing how many 9s of uptime Exchange Server 2010 can offer.

Recall that four 9s is 99.99 percent uptime, meaning that the system is down for no more than 52 minutes and 36 seconds per year. That's a paltry 9 seconds per day! A 99.9 percent uptime would allow just less than 9 hours of downtime per year, which still isn't enough for most maintenance purposes. How is it that companies are seeking—and vendors are claiming—99.9 percent or better uptime?

Let's start with a definition of what qualifies as uptime. The first time you have to install the monthly security patches—much less an Exchange rollup or a service pack—you'll blow right through your 9-seconds-per-day downtime limit on a single server. For that reason, Exchange lets you use multiple or clustered servers, and almost everyone excludes planned maintenance from uptime calculations.

With that definition in mind, how many 9s is it reasonable to expect from Exchange? The real answer is a resounding "Who cares?" Not because uptime is unimportant, but because it's the wrong measurement. Rather than counting the seconds of downtime that you can tolerate, your efforts should be focused on two areas: recovery time objective (RTO) and recovery point objective (RPO).

RTO, of course, is the amount of time you're willing to allocate to recovery operations. This figure can range from seconds to days. For example, a complete restoration from a massive failure (like, say, a large office fire that melts all your servers) might take days, but failing over users from one Database Availability Group (DAG) member to another might take only seconds. You get to choose the RTO that's most appropriate for your business, then spend the right amount to ensure that you're protected.

RPO is a bit different, but equally important: It represents the amount of data loss you're willing to tolerate. For example, an RPO of four hours means that you're able to tolerate the loss of up to four hours of mail data. RPOs can range from seconds to weeks (imagine taking a full backup only once per month).

Together, these two factors make up a significant chunk of your service level agreement (SLA). You might not have a formal, written SLA, but I would bet a box of Krispy Kreme doughnuts that you have an implicit SLA that your messaging operations are expected to meet—even if you don't find out about it until an emergency happens. Fallout over implicit SLAs often takes the form of loud arguments about uptime after a failure, threats of firing, and so on, although the results can be more subtle.

Notice that I didn't spend any time in the preceding paragraphs telling you how many 9s Exchange 2010 can deliver. That's because the answer is a big fat "It depends." In future UPDATEs, I'll be delving into this topic in more detail. In the meantime, though, I'd love to hear what your RTO and RPO are, and what your SLA (if any) says they should be.

Related Articles:

Related Content:

ARTICLE TOOLS

Comments
  • Phil
    2 years ago
    Jan 15, 2010

    Measuring up-time will vary, and the more critical the need the more finite an IT department will measure it. The “nines” are a way to relate to total unplanned downtime over a period of time (month, quarter, year), patches and/or upgrades are typically not considered unplanned downtime, hence they would not “blow right through” the unplanned downtime numbers. There are a few of things that need to be considered here. First, one cannot choose unplanned downtime. An application or service may only be needed for certain hours of the day, like 9:00 to 5:00, but downtime during those hours could be disastrous. Second, uptime in a true fault tolerant five-9’s environment also means no failure/failover, no downtime of minutes or more and no loss of ‘in flight data’, which is something that failover solutions cannot provide.

You must log on before posting a comment.

Are you a new visitor? Register Here

advertisement

advertisement

Windows is a trademark of the Microsoft group of companies. Windows IT Pro is used by Penton Media Inc. under license from owner.