What do you do when all normal debugging means fail? You can reproduce a bug, but the debugger just doesn’t seem to pinpoint it. Print statements have helped narrow it down, but it doesn’t make sense. What happens when it seems like you’ve exhausted your options? It’s time to resort to code binary searching.
A platform bug
I had an issue where my code wasn’t working on Windows. It was working fine on other backends but consistently failing when compiled for .Net. It seemed really strange since I wasn’t actually doing anything OS specific.
I wasn’t getting a typical exception either. It was an odd
FatalExecutionEngineError that I’d not seen before. Thus the typical exception trace wasn’t available.
I ran in a debugger and it found a line of code where the error occurs. Inspecting the variables I found nothing out of the ordinary. I ran the debugger again and it just kept pointing at this line of code. There was nothing about this line that should cause an error.
Rollback the code
I tried adding/removing code but wasn’t able to solve the problem. I needed a different approach.
I rolled the code back to a point where I guessed it was working before. I recompiled, ran the test again, and it was working fine. All I needed to do now was find which revision introduced the problem.
The tricky thing here is that modifying the revision required recompiling the entire project, as changing dates don’t play well with the build system. Thus, far from a fast turnaround time I was left with a rather long cycle.
It didn’t help at this point that my machine was overheating, to the point where it shut off. I had to interrupt the debugging to open my case and clear a dust build-up off the heat sink.
I had only about 30 revisions to check, but given a long build that would take forever. Using a binary search here is a good option for locating the suspect revision. I know between which two versions the defect appeared, so I just pick a point roughly in the middle. Depending on whether the defect is exhibited I know which of the two chunks of revisions are the problem.
After a bit of time I located the revisions that introduced the error. Surprisingly, the code the debugger pointed to was part of that revision. It still didn’t make any sense though.
From here I do the same thing, removing and adding back sections of code until I find the right combination that toggles the defect on and off. This requires a bit more thinking than just switching between code revisions, since I can’t just randomly comment out code.
I can’t really call it binary searching now, but the same principle applies. I comment out the largest chunks of code first, slowly refining to smaller bits. Fortunately this revision wasn’t so large, so it didn’t actually involve too many iterations.
This is a good reason why I’m in favour of frequent small commits. They are immensely helpful when trying to locate a defect this way, or to understand why a change was made.
The line of code I eventually found was the same one the debugger did. It still didn’t make any sense. But the line was a paired line, to fix the defect I needed to remove another line as well; this is perhaps why normal debugging was difficult. I reverted my code to the head of the branch, and checked that those lines were still the problem.
It may not have been clear exactly what the defect was, but by pinpointing it I was able to find a workaround rather quickly. I also filed an an issue to look into the exact cause later.
It’s not surprising that a difficult bug was revealed. By the point where I resort to binary searching my code revisions it means something tricky is happening. I once used the technique to locate a memory corruption in a large C++ codebase. The fix there was easy, but it took a long time to find.
When all else has failed, binary searching through revisions is an effective way to localize a defect.