Occam’s Razor

A couple of days ago I discovered a bug in a popular piece of software that we use at work… it took a few back and forth conversations with one of the developers on the mailing list but I was able to work past it and find a solution that put my system back on track. This little nugget came in today as a follow up to that conversation after another person (not a developer of this product) had asked me if I thought it could be bad memory… I assured him that we had looked at that as a possibility and ruled it out. Then there was this:

You lost the contents of one file didn’t you? Hits on memory from
cosmic rays cause random single bit flips at a small but measurable
rate–though ECC memory should help. These could cause something in
Linux’s page buffer to go awry and hence ending up with an empty file
when unexpected. The bug rate of normal software means that that
these sorts of errors are almost completely masked, however when code
matures these sorts of errors start to become important.

Higher assurance software tends to include checksums and other simple
invariants that are checked at various places in code to make sure that
errors like this aren’t propagated too far.

That is the most ridiculous thing I have ever read… never mind that I had already ruled out any hardware problems. I’ve been writing code for a long time and in the infrequent times where memory in a production system has gone bad I have never chalked it up to cosmic rays.

Come to think of it… I think I’ll open a bug against myself in all of my projects marked “Cosmic rays cause unpredictable behavior” and any time I come across something that’s unreproducible I’ll just mark it as a dupe of that one.

Leave a comment

0 Comments.

Leave a Reply


[ Ctrl + Enter ]