Optimizing for Fault Tolerance: RAID 1 and 5
Optimizing your server's disk storage is a balancing act: You want the best
possible performance, but you need to protect your data, too. RAID 1 and RAID 5
are two widely used methods for protecting data.
RAID 1, disk mirroring, is most often used for smaller critical data
volumes. It gives you complete fault tolerance (either drive in the mirror set
can fail without affecting system integrity or performance) and slightly better
performance than no RAID. The tradeoff? Because both drives are exact copies of
each other, you get only 50 percent of the disk capacity you purchased.
RAID 5 is the most commonly used option for fault-tolerant disk volumes in
NT because most manufacturers implement and support this method, it is part of
NT Server, and it offers a reasonable compromise between performance and disk
capacity. RAID 5 offers enhanced performance, protection, and far less capacity
loss than RAID 1. Because you can build a RAID 5 volume out of as few as three
drives, the maximum capacity you lose is 33 percent; the more drives you add,
the less total space you lose. RAID 5 offers better I/O read performance than no
RAID at all and in some cases, is even better than RAID 0 (because of the
striping algorithm used). The drawback of RAID 5 is that write performance
suffers significantly because every I/O operation requires a parity calculation.
This performance hit in software RAID 5 is high; you'll probably want to use a
fast RAID controller to compensate for the overhead.
The advantages to RAID 5 are that you can build very large fault-tolerant
disk volumes, and any drive in the stripe set can fail without damaging data.
However, fault tolerance doesn't mean you won't suffer a little if a drive
fails. When one drive disappears from the stripe set, either your system CPU or
the RAID controller must compensate on the fly by using the remaining data and
parity information to reconstruct the data for every I/O request. Depending on
your system and controller, this reconstruction could mean as much as a 50
percent performance hit on that volume--but at least you're still running!
In NT, this recovery process is automatic (as it is on hardware
controllers). NT also automatically rebuilds the volume when you replace the
faulty drive. As soon as the system gets a new drive, it begins the background
process of reconstructing the data on the new drive in the same way it handles
I/O requests on the fly (this process can take several hours, depending on the
volume/disk size). The process slows performance (more with software RAID than
on an accelerated controller), but as soon as reconstruction is finished, system
operations return to normal.
Also note that in software RAID 5, you often cannot break the set to add a
new drive. Such behavior makes RAID 5 on NT not such a great option, and some
experts never recommend this approach. In contrast, this issue does not arise
with hardware RAID.
Other Fault-Tolerance Options
Two additional RAID fault-tolerance hardware options are RAID 3 and 4.
Although they're less common on NT systems than other options (and NT does not
support them), they offer fault tolerance through striping with parity data.
In addition to providing fault tolerance through RAID, some disk
controllers have special features that ensure availability in the event of a
disk crash. Some RAID arrays feature hot-swap drives: You can remove and insert
disks without powering off the disk cage or even the specific slot.
A hot swapcapable array should never go down due to a drive failure
(barring component death of the backplane, faulty power supplies, or similar
problems). Systems without hot-swap drives require you to power down the system
to replace a bad drive. In systems with hot-swap bays, the controller/software
detects the new drive coming online and begins repairing the volume.
Another option is a hot-spare--a drive in the array that waits in standby
mode. If any other drive in the array fails, the system automatically switches
over to the hot-spare and begins rebuilding, without administrator intervention.
When you replace the faulty drive, it becomes the new hot-spare. You can enable
hot-spares through the controller's BIOS or management software.
The Best of Both Worlds
A few combined RAID levels (e.g., RAID 10, 30, or 50) offer both performance
and fault tolerance by using two forms of RAID on the same logical volume at the
same time. As you might expect, you pay more to have both capabilities. This
extra cost is because NT's Disk Administrator tool alone won't let you combine
RAID levels; to do this, you must combine a hardware RAID controller with NT's
RAID software functions.
One combined RAID level is RAID 10, also called mirrored stripe sets (i.e.,
a RAID 0 stripe set is mirrored to another stripe set). RAID 10 offers excellent
gains in read and write performance in sequential and random transaction
environments. In fact, it's the best overall performer of all RAID levels. The
cost, as with mirroring, is that you lose 50 percent of your planned disk
capacity. But, where simple mirroring (RAID 1) costs you only one drive per
mirrored set, RAID 10 costs you as many drives as are in the RAID 0 stripe set
(which can get expensive). Like RAID 1, RAID 10 makes a fault-tolerant volume
with the performance advantages of striping and no performance hit in the event
of a drive failure.
Another combination of RAID 0 and 1 is RAID 01, or striped mirror sets,
which has similar characteristics to RAID 10. The main difference between RAID
10 and 01 is which RAID level the hardware controller handles and which the
software handles. In RAID 10, for example, if the software handles the striping,
the controller performs the mirroring; in RAID 01, vice versa
Not all RAID controllers support level 10 or 01. You'll need to check which
RAID levels a controller supports before you buy it. However, you can make
combined RAID by using hardware for the first part (RAID 0 striping or RAID 1
mirroring) and software for the second (the alternative mirror or stripe,
respectively). This solution does not perform as well as using a RAID hardware
controller that can handle both at the same time. But you can still build
high-performance, fault-tolerant disk volumes without replacing an existing RAID
controller.
Other RAID levels, such as 30 and 50, can also enhance performance and
fault tolerance, depending on your applications. With them, you can build very
large disk volumes out of commodity drives. However, these RAID levels are of
limited use in most low- to midrange NT server situations, unless your goal is
to experiment or achieve new and interesting disk configurations. RAID 50 is a
good option on an enterprise-scale server where you are trying to build a 500GB
or even 1000GB disk volume.
The Right RAID
With the variety of available RAID options, you can choose the right balance
of performance and fault tolerance for your site. Mixing hardware and software
RAID lets you build disk subsystems specifically tailored to your needs, such as
extremely large disk volumes or multiple-faulttolerant arrays. Whatever
RAID you consider, it's a disk technology you can't afford to be without.