Monitoring your network involves a lot more than just keeping tabs on the health of your servers. It’s also important to determine whether your websites, applications, network infrastructure, and servers are functioning 24 x 7. Using a single dashboard for network monitoring makes the task easier.
Before you start monitoring, it’s useful to know what you’re looking for in a healthy server. Server performance is typically assessed using four Key Performance Indicators (KPIs): Processor, Memory, Disk, and Network. We can create a health model for a server that incorporates these components, as well as other factors. For example, is a server healthy when it isn’t running? If it’s configured incorrectly? If its security is compromised? The more conditions we add to our definition of a healthy server, the more useful our health model will be in assessing our servers’ health. A server’s health model is sort of like a painting of what a server should look like—we start with a rough sketch of a server, adding details that help the sketch evolve into a full-color painting of the server.
Using the health model approach lets us provide monitoring not only for servers but also for custom applications, websites, network devices, and many other important aspects of a business. In Microsoft System Center Operations Manager 2007 R2, the server health model focuses on four main areas: availability, configuration, performance, and security. Several KPIs directly determine how well a server is performing—including Processor, Memory, Disk, and Network. The Windows Server Operating System Management Pack for Operations Manager 2007 includes the server health model. One of the methods for displaying Operations Manager’s health model is the Health Explorer interface, which Figure 1 shows. This health model is extremely detailed; for the purposes of this article, let’s focus on how Operations Manager integrates the various KPIs.
Processor KPI
Typically, a processor bottleneck is defined as more than 80 percent server utilization for a period of time. Unfortunately this type of bottleneck occurs relatively frequently and can generate a significant number of alerts that might not be actionable by the server. The Operations Manager monitor (Total CPU Utilization Percentage) takes processor monitoring a step further by alerting only when multiple conditions occur. Health states for this monitor are either healthy or critical based on the following conditions:
- Critical state occurs when processor utilization (Processor\% Processor Time\ _Total) is higher than 95 percent for 6 minutes (after three samples on a 2-minute schedule) and when processor queue length (System\Processor Queue Length) is greater than 15 for 4 minutes (after two samples on a 2-minute schedule).
- For all monitors discussed in this article, when the threshold decreases below the levels defined for the critical or warning state, the monitor resets itself to a healthy condition.
This approach minimizes the amount of noise (i.e., nonactionable alerts) by providing an alert when the condition is likely to actually represent a bottleneck on a server versus a temporary spike in processor utilization. Although this approach provides a good starting point for most servers, not all servers are created alike. Some servers consistently experience higher processor workloads (e.g., servers running SQL Server or Exchange Server).
Virtualized servers also often experience higher than average processor interrupt levels that require tuning within Operations Manager. This doesn’t indicate that the virtualized guest OS has additional overhead but rather that this particular counter might not be as relevant (or might have a higher than average value) in a virtualized guest OS.
Operations Manager lets you use overrides to tune alerts to detect different thresholds for different systems or groups of systems. An override changes the default behavior of a rule or monitor for the systems on which the override is applied. For example, suppose you have a computer group that contains all virtual servers. (For information about detecting both VMware and Hyper-V servers, see “Virtual Machine Discovery MP for Operations Manager 2007”) You can target an override to change the thresholds to either a higher or lower level for that group. For processor counters, it’s common to create an override that lowers the thresholds for systems on which processor bottlenecks are likely to occur or to increase the thresholds for the average processor interrupt level.
Operations Manager doesn’t limit the health model for the processor to the Total CPU Utilization monitor. Instead, this monitor is supplemented with the Total Percentage Interrupt Time and the Total DPC Time Percentage counters. These counters can also indicate performance bottlenecks on the processor for a server.