Getting Results
Now that you know how to execute the kernel debugger, let's put this knowledge to work with some basic commands in a sample troubleshooting scenario. Imagine that your company installed some new applications and hardware upgrades on one of your department's NT servers 2 months ago. Since then, the following cycle has repeated several times: The server runs properly for several days but then users connecting to resources start to complain about slower and slower performance. Each time, the server eventually stops responding altogether and a blue screen occurs. You’ve been coming in on weekends to reboot the server and head off the problem, but you haven’t had time to investigate further. You then configured the machine for crash dump generation so when the blue screen occurs again, the OS writes a memory.dmp file to disk. After you restart the server, you can use the kernel debugger to gather information on what happened, so you copy the memory.dmp file to the debugging workstation where you've already created a symbol tree.
After the kernel debugger loads your crash dump and is ready to accept commands, the first step is to verify that the symbol tree you created on your debug system is accurate. To verify the accuracy, type the command
!locks –p
at the kd> prompt. This command dumps kernel-mode resource locks and resource performance data, if it exists. In the process of performing this dump, the command also verifies the symbol file for every loaded module. You notice that all the symbol files are found and loaded properly, with no checksum errors.
Remembering that users were complaining of slower and slower performance, you decide to look at the virtual memory statistics at the time of the crash. One thing that can affect server performance is one process using large amounts of physical memory that previously belonged to other processes or the OS. To see if this is the case, you type in the command
!vm
at the kd> prompt. This command displays information on memory in use by system processes (Paged Pool and NonPaged Pool) and for each user-mode process, as Screen 4 shows. This information is helpful for identifying processes that may have been leaking memory for long periods. When you execute the command, it shows that a user-mode process called database.exe has been using 24MB of memory on this 64MB system.
Suspecting that maybe a 64MB system is insufficiently powered to serve as a database server, you wonder what other upgrades your company performed on this system. You type in
!drivers
at the kd> prompt to show base addresses, sizes, and link dates for all kernel-mode components, as Screen 5 shows. You can look in the output of this command for recently added drivers or applications. Upon inspection, this output lists a new driver that you haven’t seen before—tapedev.sys—along with its base load address, F8C00000. By now, the information from the initialization of the kernel debugger has scrolled off the top of the screen, so you redisplay the blue screen stop codes by typing
dd kibugcheckdata l5
at the kd> prompt. The dd command displays double word memory addresses, and kibugcheckdata is a symbol name that points to the location of the blue screen stop codes in memory. The stop code was 1E (Kmode_Exception_Not_Handled), and one of the codes (F8C01482) looks suspiciously like a memory address that might be inside the driver tapedev.sys. This clue points to the recently installed hardware upgrades as a possible source of the problem.
Next, you decide to look at what other software was running on the system. You type
!process
command at the kd> prompt to display the process that owned the thread that was executing at the time of the crash, as Screen 6 shows. This process is often a good clue as to what triggered the blue screen. Running the !process command shows that the database.exe process was active. Taking it a step further, you use the !process 0 0 command to display an abbreviated list of all processes that were in memory, as Screen 7 shows. Scrolling back in the window, this command shows that almost 70 processes were running, including some of the newly installed applications—yet another sign that the server is underpowered. The !process 0 7 command displays expanded information on each process, including all its threads, what routines each thread was executing, and the total CPU time used. Armed with this information, you contact the vendors of the database application and tape driver software that your company installed on the server and politely request that they perform some thorough compatibility testing.
Certainly, most problems and debugging sessions are not this simple, and usually a good knowledge of NT internals is a must to drill down and get the exact evidence you need. However, you can discover some information about the crash that normally would have been lost. And fortunately, Microsoft has written a utility to automate some of the process. Dumpexam.exe (also located in C:\symtree\basent4\system32) executes the above debugger commands and several others against a specified crash dump file. Like i386kd.exe, it requires the symbol tree. Execute it with the following command line:
DUMPEXAM –Y <symbolpath> <crashdumpfile>
Replace symbolpath with the value of your _NT_Symbol_Path variable from the debug.cmd command file, and replace crashdumpfile with the full path to the crash dump file. Dumpexam.exe will write the output to a file, memory.txt.
Wrapping Up
NT kernel debugging is a huge topic, but you don’t have to know every last detail to make use of the technology and accelerate your troubleshooting efforts. As this article demonstrates, simply knowing how to set up the kernel debugger and extract information about the status of a failed system can be of great benefit to you and to your hardware and software vendors.