By Kyle Wilson
Saturday, May 08, 2004
If programmers are a cynical bunch, it's because we are reminded, constantly, of human fallibility. Our lives are a cycle of grand idea and skilled implementation followed by failure to build, failure to run, failure to work. We're smart people, and we never get it right the first time. We spend less time creating than we do finding and fixing the things we've done wrong. We cope with that by becoming either laid back and cynical or neurotic and cynical.
We had an interesting bug crop up the other day in MechAssault 2. When it was reported to me, I was told that the game crashed, without hitting any asserts, when a player went to the multiplayer menu and launched a one-person multiplayer mission. The crash was perfectly reproducible, which is always nice. So I said I'd have a look at it in the debugger, expecting to find a pointer dereference that wasn't checking for null or something similarly obvious.
MA2 is an Xbox title. Like most console games, MA2 does its own memory management instead of just relying on the standard-library-provided new and malloc functions. Our memory allocator does bookkeeping, categorizing allocations in different heaps and providing leak-tracking functionality. A header struct precedes each block of allocated memory, and the allocation headers point to one another in a doubly-linked list. That way we can walk the list on exit to identify leaks.
When an allocated block of memory is freed, it's removed from the doubly-linked list. The previous block is set to point to the next block and the next block is set to point to the previous block. This is where the game was crashing. The memory pointed to by the next pointer wasn't a valid memory header. It looked like total garbage.
The memory allocation being freed was an allocation by a std::string, which was a member variable of a menu component being deleted. I examined the menu component, its destructor, and the destructors of the classes from which it derived, trying to figure out if any of them were doing something untoward which could be overwriting random memory. But it didn't look like the menu component's destructors were doing anything very interesting at all. Another programmer suggested that maybe we were attempting to free a block of memory which had already been freed. But it seemed like that should be trashing the current memory header, not the next one.
Reproducibility is the key to bug eradication, and the more perfectly reproducible a problem is, the easier it is to track and kill. Since I was dealing with a crash in our memory allocation system, and memory locations would change from run to run, it would be impossible for me to backtrack from location to location on successive runs and thereby identify the point at which things first began going out of control. So my first step was to try to achieve perfect reproducibility. I made an input recording up to the point of the crash.
A game is simply a piece of lockstep machinery with all variation driven by certain inputs. These are clock time, controller inputs, and network messages. If you log all of these, you should be able to reproduce a recorded game exactly. If you flush the output file every frame while you record, you should be able to perfectly reproduce a crash. I played back my recording. The game ran to the end of the recording, at which point it neither crashed nor launched a mission, but simply continued to sit at the final menu. I picked up the controller and launched the mission which had crashed before. Now, after playing the input recording up to that point, it did not crash. The mission launched and ran.
This was very annoying.
Okay. I knew that I could reproduce the crash reliably if I didn't play an input recording. And I knew that it always crashed deleting a menu component. I decided that if I could observe the last moments of life of that menu component, maybe I could figure out when our memory headers were getting corrupted. So I added an OutputDebugString call with a counter in the component shutdown loop. I made a note of the last component number displayed before the game crashed. I ran again with a conditional breakpoint set to break at that component number. The breakpoint was hit. I hit continue to make sure the crash would still occur. It didn't.
Fine. I replaced the conditional breakpoint in the debugger with an assembly-language int 3 breakpoint in code. I ran, hit the breakpoint, continued. The game crashed. Excellent. I ran, hit the breakpoint, stepped line-by-line over the deletion of the component... and the game didn't crash. The component was deleted. The mission loaded.
I began to realize that this wasn't going to be a stand-up fight. It was as if the bug hid itself whenever I got close.
I decided to backtrack. When I was able to get the crash to occur, it was always because of a bad pointer in a memory allocation header. Headers were all stored in a doubly-linked list. If any link in that list went bad, I should be able to detect it by walking the entire list and verifying at each block that the previous block's next pointer and the next block's previous pointer both pointed to the current block. I modified an existing allocation counting function to do this. It seemed like the bad header was being corrupted somewhere in menu shutdown, since that was the where the game kept crashing. I added calls to my allocation-list verification function in the menu component shutdown loop before and after each component was deleted. If a component's destructor was corrupting memory, I'd know which one immediately.
Instead, the list-checker went off the first time it was called, on a header at address 0x00cb100 whose next block's previous pointer was 0x00cb101.
This seemed a little suspicious. All of our memory allocations should have been 4-byte-aligned, which 0x00cb101 clearly wasn't. But 0x00cb101 didn't really seem like totally trashed memory, either. It was only greater than the value it should have had by one. Was this some incredibly rare special-case that the memory allocator was setting incorrectly for some reason? Had someone tried to squeeze extra data into the memory allocation header by using the unused lowest bit of one of the pointers, but forgotten to mask it out before it was used?
I added calls to my list checker function at the beginning and end of our alloc and free functions, and ran again. The game ran very, very slowly. But when my list-checker triggered again, it was at the beginning of an alloc call. The memory allocator was exonerated. Something else had corrupted the list sometime since the last allocation or deletion.
I looked higher up the stack. A particular menu component was being processed. The menu hadn't even begun to shut down yet. I removed my list-checking calls from the memory allocator and seeded them throughout that menu component's process function.
In the multiplayer menu, there's a display of the number of players on each team. In the corresponding menu component class, this information is stored in a TeamInfo struct. Every TeamInfo struct stores counts of the number of ready and unready players, and every frame these counters are reset to zero and the game iterates over all players incrementing the appropriate counters.
The bug was a result of a contract failure. The programmer who wrote the menu component understood that there would only be four teams available in multiplayer, so the menu component contained an array of only four TeamInfo structures. The code to count ready players looked something like
There was no assertion that pPlayer->nTeam was less than four.
The programmer responsible for creating network player objects didn't realize the restriction on team count, and was assigning players to team 9. The out-of-bounds memory access incremented a random memory location, which happened to be the next pointer of the memory header for a standard string.
Now these were both bright guys, and very capable programmers. This sort of thing happens, unfortunately, and it mostly happens in the gray spaces where one programmer's code is interacting with another's. The companies that I've worked at have neither had strict code ownership nor practiced pair programming, so someone is always interfacing with code he doesn't entirely understand.
Programmers will always make mistakes. They'll especially make mistakes when dealing with code they don't understand. But just as a crash is a sign that there's a bug in code that needs to be fixed, frequent bugs indicate that there's a bug in the development process that needs to be fixed.
The right answer, I think, is not code ownership but code responsibility. One engineer needs to be made responsible for each library or logically self-contained module in the engine. He should be the primary developer within that library and should review any other check-ins made to that library. Public interfaces exposed by a library should be subject to a more general code review, either by a chief architect or by interested peers.
But we also just need to be good at debugging. I think I did pretty well with this one, but I could have done better. I wasted time trying to figure out why input playback wasn't doing what I expected when I should have focused on the problem at hand. When it became obvious that the memory header list was becoming corrupted I should have written my list verification function immediately instead of trying to trace through what was happening in the debugger. Even if the crash had still happened when I stepped the debugger through the deletion, that wouldn't have given me any information that helped in tracing the trail back further.
Debugging is the process of tracing a continuous chain of events backwards from a final known incorrect state to the initial incorrect state that led to it. At any given stage, you should be asking yourself, What's wrong here? How did it get that way? How can I trap it when it happens?
Any opinions expressed herein are in no way representative of those of my employers.