It's no fun when you're paged at 2:00 A.M. because your production server unexpectedly rebooted due to a bug check—a Windows system crash that can be caused by any of a number of conditions, such as malfunctioning hardware. But that's what happened to one server administrator when his hardware malfunctioned. The admin contacted Microsoft, and we analyzed the dump file. We found evidence of his hardware problem in the form of a bit flip. A bit flip occurs when you're copying data and one of the bits changes so that it's incorrect. A value of 1 incorrectly becomes a zero, or vice versa. Bit flips that lead to bug checks are a common way that Windows detects a hardware problem (e.g., bad memory, an overheating CPU).
In this article, I'll explain what a bit flip is and demonstrate an example of how we found one such bit flip, when a bit changed to an incorrect value as the CPU attempted to copy data caused the system to crash. That way, if Microsoft support reviews your memory dump, and the support engineer explains that we found evidence of a hardware problem in the form of a bit flip, you'll have a solid understanding of what the engineer is talking about. I'll also provide some background information about access violations, registers, and the mov assembly language instruction.
Access Violation
The customer's server generated a memory.dmp file, which the customer submitted to the Microsoft Global Escalation Services team for analysis. I loaded the crash dump into the Windows debugger and began my review. (For more information about how to load a dump file into the debugger on your system, see "Administrators' Intro to Debugging.")
Once the dump file was loaded into the debugger, I ran the command
!analyze -v
which provides basic information about the type of crash that occurred. The textual output from !analyze -v explained that invalid memory was referenced. Also, the debugger displayed the instruction that the CPU was attempting to execute when the crash occurred. This type of crash usually occurs when a pointer gets set to some value that it should not have been set to. Pointers should hold the address of where data is located in memory. If pointers are set to some bad value, the system can crash while attempting to follow that value. When this type of crash occurs as a result of the system accessing a garbage address, the crash is commonly referred to as an access violation.
The !analyze -v output has also listed the assembly language instruction that caused the access violation. In the output that Figure 1 shows, 80546944 was not a valid address, as indicated by the question marks shown next to the address. When the code that was running on the CPU tried to access this address, a page fault trap occurred, followed by an access violation.
Figure 1: !analyze -v output showing access violation
PAGE_FAULT_IN_NONPAGED_AREA (50)
Invalid system memory was referenced. This cannot be protected by try-except,
it must be protected by a Probe. Typically the address is just plain bad or it
is pointing at freed memory.
Arguments:
Arg1: 80546944, memory referenced.
mov eax, dword ptr [esi] 80546944=????????
Introducing the mov Command
Notice the mov command (which stands for move) in the output in Figure 1. Executing the mov command on the CPU copies the source to the destination. I'm not sure why this command wasn't called copy instead of mov, since the command doesn't delete the data from the source.
The mov command needs to know what data it must copy from and where to copy the data to. This information is provided in the form of operands. In the mov instruction in Figure 1, the EAX register is the first operand, and the first operand is the destination. (I'll explain what registers are shortly.) The second operand is dword ptr [esi]. This represents the address pointed to by the ESI register. How do we know that it is the address pointed to by ESI and not the ESI register itself? Because the debugger has surrounded the ESI register in brackets. The brackets tell us that the processor was not using the register itself but was instead using the contents of the register as a pointer to the virtual address where the data is located.
The debugger has also output dword ptr, which also tells us that the ESI register will be treated as a dword-sized pointer. Using a pointer as an address to get the actual data is called dereferencing the pointer. To summarize, the debugger has helped us identify that the command that was executing on the CPU when the crash occurred was trying to copy memory from the address contained in the ESI register to the EAX register. So there are registers with names, but what is a register anyway?