Microsoft provides a handy little tool with
Exchange Server 4.0 called LoadSim (as seen in Screen A), which functions as a
load generator and user simulator for capacity testing a messaging
platform-specifically, Exchange. It runs on one or more client machines in
tandem, sending and receiving messages, accessing public or private folders,
etc., as it emulates the activities of a normal Exchange user.
While LoadSim was intended to be a capacity
planning tool (to find out how many users you can support on a system, with what
kind of response times), it also makes an excellent performance testing tool if
used properly. However, LoadSim is not without problems. If you are aware of
them, such as client dependencies, a quirky user interface, and sometimes
unpredictable behavior, you can use it to test existing systems-or find out what
a new one will do for you-by planning your testing strategy around these holes.
In the Windows NT Magazine Lab, we decided that LoadSim would make an
excellent first step in testing server hardware as messaging platforms-we can
tune the system configuration (number of CPUs, amount of memory, disk and
network layouts, etc.) and change the user load (number of users, transaction
mix) to come up with curves that tell a more complete story about a particular
machine. Instead of a single number characterizing the performance of an entire
client/server system, we can use these curves to find trends and breakpoints of
various types of systems.
Know Your Enemy
First, lets look at the problems we know about.
Client dependencies in LoadSim are fairly significant-the horsepower of the
client system has a large bearing on measured response times. LoadSim is more
memory constrained than CPU constrained, but even with a large amount of memory,
the client falls down on high user counts. Besides, you have to do what's real -
you can't simulate 1000 users on a single physical client system, because it
introduces new dependencies at the client level that you are trying to
avoid-actually, it introduces dependencies that you are trying to measure on the
server! With too high a user count, whether the CPU is fully taxed and memory is
optimal or not, the I/O capabilities of the client system get in the way. With
an appropriately fat client, you can simulate a certain number of users and
attain the same throughput for each one (within an acceptable tolerance) as you
would having a separate physical machine for each client. If you go too far, you
hit bottlenecks in the client such as network bandwidth, memory, CPU, and disk
utilization, etc., that warp your results.
When we set up our testing environment for the
Tricord review, we ran tests using a maximum configuration on the server (four
CPUs, 1GB of RAM), while varying the number of users simulated on a single
physical client system. We found that the response time didn't start
degenerating noticeably until we went above 100 users (that is, the response
time at 10 users was within 10%-15% of that for 100 users). Also, other vendors
such as Compaq, and even Microsoft, have performed similar tests in a comparable
environment to the one we used, and came up with the same results for client
load. We also tuned the user load and think times (how long the pause is between
user operations) to values between absolute "real world"-which is an
eight-hour day with long breaks between actions-and a livable testing
environment that wouldn't take 24 hours to get a single data point. We ended up
with a two-hour day, and a four-hour test run, which neither overwhelmed the
client system, nor represented an unrealistic environment. We took data points
from the two middle hours (the last half of the first day and the first half of
the second day), so that the ramp-up time (the first hour) for the test to reach
steady state did not influence the results, nor did the ramp-down as the users
log off.
Since we could operate within a reasonable
range of real world results, and keep the test believable and repeatable, we
determined that LoadSim was a good starting point for messaging tests. But what
about the other problems I mentioned, like the inconsistent interface and
unpredictability, which would seem to contradict using this tool at all?
The interface is a resolvable issue-it just
takes a little babysitting of the test runs. The utility itself follows the
typical Microsoft GUI guidelines (rather than being a command-line interface),
but the error trapping is a little weak, so restarting the tool or reloading a
set of test parameters can change test settings. Before each run, we had to
double check every system to make sure that it was going to run the test we
intended.
Unpredictability is a little more difficult to
deal with, and it is a two-fold problem. First is the unpredictability of the
interface, which I just explained. Second is the unpredictability of the test
results. On the one hand, LoadSim is a fine end-to-end testing environment,
while on the other hand you don't really know what it is measuring, and can only
infer certain things by analyzing the results against server operations (such as
disk and CPU utilization). There is a narrow band of settings in the test, as
well as a specific hardware configuration on the server, that seems to give
relatively error-free logs (see the section on load and scaleability in the main
article). A test run isn't necessarily invalid if there are errors-it just
points to bottlenecks in the system.