From the UNIX/mainframe perspective, large disk
systems with gigabytes of capacity are nothing new. Fault tolerance and RAID
configurations are something to take for granted. However, from the PC point of
view, these systems can be daunting and confusing. Many administrators just take
the default settings on their server and leave it at that. The problem with this
approach is that although it is simple, you aren't helping server performance by
not optimizing your disk subsystem for the applications you run or the
configuration of your hardware.
Our Exchange/NT scaleability tests gave us an
ideal place to begin analyzing the effects of disk configuration and performance
on overall system performance. On a system such as the Tricord PowerFrame, which
gives you a wide variety of setup options, you have lots of room to find the
ideal organization of disks, controllers, and files.
Tricord studied RAID performance on NTFS
volumes, and we used it as a basis for setting up our test environment. What
Tricord found (using its own hardware-accelerated RAID controller and nine 2.1GB
Seagate drives) was that stripe size has a big impact on performance, the type
of disk transaction (sequential or random, read or write) affects throughput,
and the RAID level plays an important role (what level of fault tolerance do you
want vs. how much it costs) in overall system performance. This may all seem
patently obvious (the word "duh" comes to mind), but if you aren't
thinking about performance when you set up your system, you aren't taking
advantage of what's available.
The Tricord test (as seen in Graph 1) varied the stripe size (disk sectors per stripe, a.k.a. logical blocking
factor-NOTE: stripe size is not the same as cluster size on a single drive; a
cluster determines the minimum number of bytes that form a logical unit on a
disk-a file cannot take up less space than a single cluster; RAID stripe width
determines how much data is written to a disk in a single chunk-in 512 byte
sectors-before the controller submits the I/O to the next drive in the stripe
set) from eight sectors per stripe to 4096 sectors per stripe on three types of
RAID volumes: 0, 5, and 10. The test ran on a PowerFrame ES133 (two 133MHz
Pentium CPUs, 256MB RAM, one 2-channel Intelligent Storage
Subsystem-ISS-controller, running NT 3.51 with no service packs). Tricord
configured the nine drives as follows: Drive 1 on bus 1 had a 300MB FAT
partition with the OS on it, and the rest of the drive was a single FAT
partition with the pagefile on it; drives 2 through 9 resided on bus 2 and were
striped (RAID 0, 5, or 10, at the varying stripe widths) for an NTFS volume with
8GB of usable storage.
The RAID 0 volume (striping only) only used
four drives, with two drives on each available SCSI channel to get optimal
controller performance (splitting read/write activity across multiple drives
greatly enhances performance over a standard sequential volume set, and further
splitting this activity across multiple channels on the same controller improves
it even more since the controller can parallel task the I/O). The RAID 10 volume
(mirrored stripe sets) used eight drives (as seen in Figure A), split up such
that half of each mirror set and stripe set straddled the two SCSI channels
(using both buses for both striping and mirroring only slightly improved
performance over using one channel for each stripe set). The RAID 5 volume
(striping with parity) also used both buses-three drives on bus 1 and two on bus
2. And, although the 8MB disk cache proved to offer only a minimal performance
improvement under this test, it was enabled for each of the logical devices
being tested.
The tool Tricord used (which we will also be
using to test RAID solutions in the near future) was designed by National
Peripherals for performance testing RAID on NT. It generates all of the load
(sequential reads and writes, random reads and writes) at the server instead of
over a network. It used a 100MB test file accessed by the testing application in
64KB blocks-each test run lasted 180 seconds. The test measures data transfer
rates in MB per second.
As you can see in Graph 1, a stripe size of 1024 sectors per stripe is the point where performance starts to dip or level off for all RAID and I/O activity types. On the RAID 0 volume, performance characteristics are very similar for reads and writes-random activity performs slightly better on this
type of volume than sequential activity. On the RAID 5 volume, write operations
are far slower than reads, and random I/O is faster than sequential I/O once
again. On the RAID 10 volume, there was a significant gap between read and write
performance: read was much higher than write on both random and sequential I/O.
Analysis
Choosing a RAID level depends on what you are
doing and how much money you have to spend. RAID 0 offers high performance at
low cost-you can stripe many drives for the best I/O throughput, but there is no
redundancy or fault tolerance (having a hardware RAID controller is much faster
than using NT's striping). RAID 5 offers fault tolerance at a cost only slightly
higher than RAID 0, since it requires only one drive for parity data. However,
RAID 5 has the slowest I/O: a hardware accelerated RAID controller can alleviate
some performance problems-and will be much faster than NT's software RAID 5-but
it will still be far slower than RAID 0. RAID 10 offers the best combination of
performance and fault tolerance (especially if your system supports hot-swap
drives) on the system we used for the NT scaleability tests (future tests will
verify this on other systems as well). The problem is that RAID 10 is very
hardware intensive, requiring multichannel hardware accelerated controllers, and
twice as many drives as RAID 0. Since you are building mirrored stripe sets, you
don't need a parity drive, and you don't need to duplex controllers (although
you can, but you'll take a performance hit since it is done via software in NT).
RAID 10 is an excellent option for enterprise
mission critical applications where fault tolerance is absolutely necessary, as
is high performance, and money is no object. Most IS shops have financial
considerations, so they should consider smaller RAID 10 volumes for those
portions of the system that need the performance and fault tolerance (such as
data drives), and use normal striping for everything else (or RAID 5).
This consideration brings up the point of the
I/O transaction mix. The results graph shows the disparities among the types of
I/O on the different RAID volumes, so you should analyze your workload before
choosing a RAID level. If you are in a write-intensive environment, do not use
RAID 5-use RAID 0 with frequent backups, or RAID 10. A mixed environment runs
very well on a RAID 10 volume. A read-intensive environment will benefit from
RAID 10, followed by RAID 5, and then RAID 0. A write-intensive environment
performs better on a RAID 0 volume. Again, it depends on how much money you
have, what performance vs. fault tolerance you need, and what your workload is.
We chose RAID 10 for all volumes under test
with Exchange/LoadSim on the PowerFrame, because we anticipated a very mixed I/O
environment (write intensive and sequential for the log volumes, and random
reads and writes for the data volumes)-we also gained a high level of fault
tolerance in case of problems during a test run. The I/O turned out to be
predominantly write-oriented (anywhere from 97% writes and 3% reads when the
system had 1024MB of RAM, to a 60/40 write/read split at 128MB), so RAID 10 was
definitely the best choice-plus, price was no object since we had 60 drives
lying around the Lab!