Sunday, February 14, 2010

A bad workman blames his tools

One of the most depressing things about being a programmer is the realization that your time is not entirely spent creating new and exciting programs, but is actually spent eliminating all the problems that you yourself introduced.

This process is called debugging. And on a daily basis every programmer must face that fact that as they write code, they write bugs. And when they find that their code doesn't work, they have to go looking for the problems they created for themselves.

To deal with this problem the computer industry has built up an enormous amount of scar tissue around programs to make sure that they do work. Programmers use continuous integration, unit tests, assertions, static code analysis, memory checkers and debuggers to help prevent and help find bugs. But bugs remain and must be eliminated by human reasoning.

Some programming languages, such as C, are particularly susceptible to certain types of bugs that appear and disappear at random, and once you try figuring out what's causing them they disappear. These are sometimes called heisenbugs because as soon as you go searching for them they vanish.

These bugs can appear in any programming language (and especially when writing multi-threaded code where small changes in timing can uncover or cover race conditions). But in C there's another problem: memory corruption.

Whatever the cause of a bug the key steps in finding an eliminating a bug are:

  1. Find the smallest possible test case that tickles the bug. The aim is to find the smallest and fastest way to reproduce the bug reliably. With heisenbugs this can be hard, but even a fast way to reproduce it some percentage of the time is valuable.

  2. Automate that test case. It's best if the test case can be automated so that it can be run again and again. This also means that the test case can become part of your program's test suite once the bug is eliminated. This'll stop it coming back.

  3. Debug until you find the root cause. The root cause is vital. Unless you fully understand why the bug occurred you can't be sure that you've actually fixed it. It's very easy to get fooled with heisenbugs into thinking that you've eliminated them, when all you've done is covered them up.

  4. Fix it and verify using #2.


Yesterday, a post appeared on Hacker News entitled When you see a heisenbug in C, suspect your compiler’s optimizer. This is, simply put, appalling advice.

The compiler you are using is likely used by thousands or hundreds of thousands of people. Your code is likely used by you. Which is more likely to have been shaken out and stabilized?

In fact, it's a sign of a very poor or inexperienced programmer if their first thought on encountering a bug is to blame someone else. It's tempting to blame the compiler, the library, or the operating system. But the best programmers are those who control their ego and are able to face the fact that it's likely their fault.

Of course, bugs in other people's code do exist. There's no doubt that libraries are faulty, operating systems do weird things and compilers do generate odd code. But most of the time, it's you, the programmer's fault. And that applies even if the bug appears to be really weird.

Debugging is often a case of banging your head against your own code repeating to yourself all of the impossible things that can't ever happen in your code until one of those impossible things turns out to be possible and you've got the bug.

The linked article contains an example of exactly what not to conclude:

“OK, set your optimizer to -O0,”, I told Jay, “and test. If it fails to segfault, you have an optimizer bug. Walk the optimization level upwards until the bug reproduces, then back off one.”

All you know from changing optimization levels is that optimization changes whether the bug appears or not. That doesn't tell you the optimizer is wrong. You haven't found the root cause of your bug.

Since optimizers perform all sorts of code rearrangement and speed ups changing optimizer levels is very likely to change the presence or absence of a heisenbug. That doesn't make it the optimizer's fault; it's still almost certainly yours.

Here's a concrete example of a simple C program that contains a bug that appears and disappears when optimization level is changed, and exhibits other odd behavior. First, here's the program:

#include <stdlib.h>

int a()
{
int ar[16];

ar[20] = (getpid()%19==0);
}

int main( int argc, char * argv[] )
{
int rc[16];

rc[0] = 0;

a();

return rc[0];
}

Build this with gcc under Mac OS X with the following simple Makefile (I saved it in a file called odd.c):

CC=gcc
CFLAGS=

odd: odd.o

And here's a simple test program for run it 20 times and print the return code:

#!/bin/bash

for i in {0..20}
do
./odd ; echo -n "$? "
done
echo

If you run that test program you'd expect a string of zeroes, because rc[0] is never set to anything other than zero in the program. Yet here's sample output:

$ ./test
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

If you are an experienced C programmer you'll see how I made that 1 appear (and why it appears at different places), but let's try to debug with quick a printf

[...]
rc[0] = 0;

printf( "[%d]", rc[0] );

a();
[...]

Now when you run the test program the bug is gone:

$ ./test
[0]0 [0]0 [0]0 [0]0 [0]0 [0]0 [0]0 [0]0 [0]0 [0]0 [0]0
[0]0 [0]0 [0]0 [0]0 [0]0 [0]0 [0]0 [0]0 [0]0 [0]0

Weird, so you move the printf:

[...]
rc[0] = 0;

a();

printf( "[%d]", rc[0] );
[...]

and get the same odd result of a disappearing bug. And the same thing happens if you turn the optimizer on even without the printfs (this is the opposite of the situation in the linked article):

$ make CFLAGS=-O3
gcc -O3 -c -o odd.o odd.c
gcc odd.o -o odd
$ ./test
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

This all came about because the function a() allocates a 16 integer array called ar and then promptly writes past the end of it either 1 or 0 depending on whether the PID of the process is divisible by 19 or not. It ends up writing on top of rc[0] because of the arrangement of the stack.

Adding printfs or changing optimization level changes the layout of the code and causes the bad write to not hit rc[0]. But beware! The bug hasn't gone, it's just writing on some other bit of memory.

Because C programs are suspectible to this sort of error it's vital that good tools are used to check for problems. For example, the static code check splint and the memory analyzer valgrind help eliminate tons of nasty C bugs. And you should build your software with the maximum warning level (I prefer warn-as-error) and eliminate them all.

Only once you've done all that should you start to suspect someone else's code. And even when you do, you need to follow the same steps to reproduce the bug and get to the root cause. Most of the time, unfortunately, bugs are your fault.

Labels:

10 Comments:

Blogger Eric said...

On the other hand, here's the concrete example this advice was directly derived from (let's hope this survives your blog's formatter):

#define isgps_parityok(w) (isgps_parity(w) == ((w) & 0x3f))

#if 0
/*
* ESR found a doozy of a bug...
*
* Defining the above as a function triggers an optimizer bug in gcc 3.4.2.
* The symptom is that parity computation is screwed up and the decoder
* never achieves sync lock. Something steps on the argument to
* isgpsparity(); the lossage appears to be related to the compiler's
* attempt to fold the isgps_parity() call into isgps_parityok() in some
* tail-recursion-like manner. This happens under -O2, but not -O1, on
* both i386 and amd64. Disabling all of the individual -O2 suboptions
* does *not* fix it.
*
* And the fun doesn't stop there! It turns out that even with this fix, bare
* -O2 generates bad code. It takes "-O2 -fschedule-insns" to generate good
* code under 3.4.[23]...which is weird because -O2 is supposed to *imply*
* -fschedule-insns.
*
* gcc 4.0 does not manifest these bugs.
*/
static bool isgps_parityok(isgps30bits_t w)
{
return (isgpsparity(w) == (w & 0x3f));
}
#endif

6:51 PM  
Blogger John Graham-Cumming said...

Very interesting. What was the difference in the generated assembly code?

7:20 PM  
Blogger tjp said...

If only it was that easy w/ front-end development :)

8:03 PM  
Blogger Jay Freeman (saurik) said...

It is also the sign of an inexperienced programmer to presume that the code you rely on and that thousands of other people have supposedly vetted is not the cause of the problem at hand. Frankly, this is a more serious handicap than always blaming the compiler. What you eventually get better at is determining where the bug really is, and to have any hope of doing that you need to debug the system globally. The goal should be: trust no one, not even yourself.

8:38 PM  
Blogger tz said...

That is the key - check the assembly, especially around the bug. You CAN turn debugging on and set breakpoints.

Sometimes in embedded systems or multithreaded systems, things appear because something wasn't declared volatile. That is much more common in the embedded world (MIPS, Atmel, used to be ARM). I would not suspect an optimizer bug on x86, but have found a few bits of overaggressive optimization in some of the early versions of gcc for a new architecture.

It might be a bug in the optimizer, so reduce the case and check the assembly including in-situ with the debugger backing up to the root cause. Even many embedded chips have data watches. You will either find the cause or be able to create a minimal test case.

8:44 PM  
Blogger Regehr said...

John- I just wrote up blog post on this topic that may be interesting for you.

8:50 PM  
Blogger David Cotton said...

An interesting post, thanks.

I would add a vital fifth stage: Once you know the root cause, look for other areas in the code where the same thing might be happening.

It is surprising how often you find the same error has occurred in several different places (and not always due to copy-n-paste). Often a programmer has not understood a concept fully and repeatedly made the same mistake. Nothing is more annoying than spending days tracking down a bug when someone saw something similar months before. A little time and effort can save much heartache later on.

10:26 PM  
Blogger Brian Bulkowski said...

I, also, have been thinking about that "blame the optimizer" bug. I'm a diehard C programmer who has created multiple large projects in C.

99% of the time, a bug such as referenced by the original post is a timing hole or an uninitialized variable. Timing holes in multithreaded, multicore, and distributed programs can be generated in any language.

The comments of the original post claimed the programmer inserted a few NOOPs in the code (or similar) and moved on to the next bug. The fact that timing solved the bug - not program flow - clears the optimizer of responsibility.

When the bug is an uninitialized variable or a memory overwrite, the C programmer's tool of choice should be 'valgrind'. This wonder of a program will catch every kind of bad reference a C programmer can write.

Problematically, 'valgrind' induces a run pattern that is single-core, and about 10x slower than running the code stand-alone. Still, 'valgrind' is a stunning tool for C programmers working in Linux ('valgrind' is broken in BSD when I last checked, there's a patch set that might work). A similar Windows tool is the for-pay program 'purify'. I can't speak for OS X.

My first tools when approaching a bug of any sophistication is reduce the reproduce case, examine the code, and pull out my swiss army of tools. In any language, debugging is a professional skill that requires honing.

Regarding optimizer bugs, I have seen literally three in 20 years - and none in my code specifically. There was a good one related to handling of (double) arithmatic on Opteron processors in an early GCC 4 release (ran fine on Xeons). I remember a couple more from the GCC 2 days while doing embedded work.

12:05 AM  
Blogger eay said...

I've identified quite a few compiler bugs over the years. The most useful initial technique is to try another compiler on the same platform. When one compiler fails and the other works there is a chance it could be a compiler bug. At this point inserting printf()s will often help track down the error because the call will insert a barrier in the optimization of the function (causing it to start working again). Doing this will help determine where to look in the generated assembler.

1:59 AM  
Blogger David R. MacIver said...

Of course compiler bugs shouldn't be your first explanation, but they shouldn't be your last either.

The thing is, tens or hundreds of thousands of people testing it is actually a rather small number.

Or rather it's a small number compared to the number of things that can go wrong.

It's the usual problem where testing every line of code in your project (in this case, the compiler) isn't actually enough. You need to test every possible unique path through the program. And there are a hell of a lot of them.

This means that it's actually completely unsurprising when code in your project takes the compiler down some never-before-seen route, and code like that has a way higher chance of hitting a bug than the amount of testing the compiler has seen would suggest.

Fortunately it doesn't matter all that much from the point of view of debugging! You still need to figure out and fix the bug either way, and the best way to do that is usually to come up with a minimal example that demonstrates it. At that point it's much more obvious what the cause of the error is and much easier to figure out whether it's a compiler bug or user error.

5:08 AM  

Post a Comment

Links to this post:

Create a Link

<< Home