Software bugs don’t have to be scary

Software bugs aren't as scary as this one.

It certainly can be scary looking at a long bug list with a short deadline, but identifying and fixing software bugs can be manageable with a few strategies.

A second set of eyes.

Sometimes a very simple software error can be staring at us, but we miss it because we have been staring back so long.   An example I faced was a section of code appeared to be setting a value used elsewhere in the code, but that value was never changing.  When I finally had someone look at that code for me, the first question was “Isn’t that a pointer?”  Then I could see what I had been missing for hours, that I was changing the value of the pointer, and not the contents of the location it was pointing to.  Agile methodology attempts to reduce these problems with pair programming, and sometimes a good night’s sleep instead of an all-nighter is the remedy.

 The right tools.

Often the fastest way to find the source of an error is to step through the code using a debugger or in-circuit emulator.  There is no faster way to find things like using an && when it should have been an &, assigning the wrong variable, or any number of logic problems.  Sometimes improvisation is necessary, I was once debugging a flash write routine that had to be copied to RAM to execute.  I did have two unused port pins on the micro, so I strategically flipped the state of those pins and used an oscilloscope to trace through the code until I got it working correctly.

A systematic approach

Some software errors are more difficult than others to diagnose.  One particularly nasty case I faced was a communication error between two modules on a machine that occurred on startup, but only about 2% of the time, and only in a complete machine, not on my test bench.  Add in some deadline pressure and high visibility and I found myself debugging code in a vehicle cab, with several managers milling around outside.  My approach was to use a process of elimination, making small changes to eliminate possible causes.  When I added a few new network messages to be broadcast, I changed the timing enough to make the error happen every time, instead of intermittently.  Not only did that make the issue easier to track down without the constant cycling the key on and off, it gave me a great clue as to the root of the problem – network message timing, and the issue was quickly resolved.

Taking the right approach can make debugging relatively painless, and helps to answer the question “when will you have this bug fixed?” with something better than “1 hour after I figure out what is causing it!”