Troubleshooting your crashed NT systems
The use of Microsoft's command line x86 kernel debugger is commonly seen as a black art, both by experienced support professionals and new Windows NT users. When workstations or servers suffer a failure and display a blue screen, they generate a crash dump. Unfortunately, many users ignore or delete these crash dumps. However, with some basic preparation and knowledge, you can use the kernel debugger to yield valuable information on the systems' state at the time of failure. You can then correlate this information with the installed hardware, software, and other system parameters to help formulate a strategy for troubleshooting the system.
Although a full treatment of the kernel debugger and debugging tactics might fill several books, setting up the kernel debugger to debug a crash dump is not difficult. This article explains this process step by step and presents several specific command examples that demonstrate how you can use the debugger to extract useful information from crash dumps. I've also presented references to existing literature on kernel debugging for further research.
Kernel Debugging Basics
Debuggers let you inspect and troubleshoot program code as it runs. You can examine variables, registers, and stacks, and pinpoint problems by stepping line-by-line through a program. Some debuggers support source-level debugging by matching the developer’s source code (written in C, Basic, or another high-level language) with the corresponding machine instructions. This level of detail shows how the system compiled each line of source code and the exact effect of that code on the system. Other debuggers support only direct machine instruction or assembly language debugging. You typically use kernel debuggers to debug core OS components and drivers and use user-mode debuggers to debug applications and services. In a live debug session, a serial cable connects the target machine that you want to debug with a host machine that runs the debugger. Debug code running on both machines communicates commands and data via the serial ports. In a crash dump debug session, you analyze the crash dump file representing the complete contents of memory at the time of the crash offline after the failure has occurred.
Several debuggers are available for NT from Microsoft and third-party software vendors such as Compuware NuMega. Two well-known debuggers from Microsoft for use on the x86 platform are i386kd.exe and windbg.exe. I386kd.exe (available in the \support\debug\i386 directory of the NT 4.0 CD-ROM) is the command-line kernel debugger for x86 code. Windbg.exe (available separately from Microsoft) is the GUI version of i386kd.exe and can perform kernel-mode debugging and user-mode debugging. Each debugger executable interprets the register, stack, and instruction information for a particular processor architecture. For Alpha code, alphakd.exe is the equivalent of i386kd.exe. (This article refers to the term kernel debugger to mean the x86 i386kd.exe from NT 4.0.)
Microsoft and third-party software vendors sometimes request that customers submit crash dumps as compressed files for diagnostic purposes. In some instances, vendors request permission to dial in to a customer's site and engage in a live debug session. The vendors typically perform these sessions using i386kd.exe because they can easily export or pipe this tool to the host machine and then access the failed system remotely via the remote.exe utility, which is available from Microsoft Windows NT 4.0 Server Resource Kit. Even if you never debug your crash dumps, setting up the symbolic information ahead of time will speed up this debugging process.
Blue Screens and Crash Dumps
The blue screen of death is something every experienced NT support professional has seen. The sidebar, "Windows NT Kernel Debugging Resources," lists resources that explain why blue screens happen and how to interpret them. As a refresher, the blue screen indicates that the OS encountered an abnormal situation that it couldn't handle using normal error mechanisms. The OS consequently decided that it couldn't guarantee continued safe processing. Rather than risk corrupted data, NT provides a special internal function known as KeBugCheckEx(). The OS and device drivers use this function to halt the system when they find themselves in the previously described situation. After taking control of the system and placing the display into VGA 80x50 mode, this function generates all the information seen on the blue screen, such as the stop code/parameters, driver addresses, and stack data. The function also generates a crash dump, but only if you select the Write debugging information option on the Startup/Shutdown tab of the System applet in the NT 4.0 Control Panel, as Screen 1 shows. Assuming you properly size the paging file, the OS invokes the savedump.exe utility to write the contents of memory into the paging file and mark the location with a special code. Upon reboot, NT copies this part of the paging file to the filename specified, usually \%systemroot%\memory.dmp.
After the Savedump utility writes the contents of memory to disk, the OS displays a message to this effect and you can restart the system to restore operation and access the crash dump. I suggest you move the memory.dmp file from the crashed system to removable storage or another location on the network.