More war stories on checksum failures over the years. Craig Partridge recalls “some part of BBN” experienced an NFS checksum issue and that it “took a while for the corruption of the filesystem to become visible…errors are infrequent enough that NIC (or switch, or whatever, …) testing doesn’t typically catch them. So bit rot is slow and subtle — and when you find it, much has been trashed (especially if one ignores early warning signs, such as large compilations occasionally failing with unrepeatable loading / compilation errors)”. Craig is absolutely right – this was exactly the case with the Sunbox project I described as well as the datacenter mirror example (see Checksums – Don’t Leave the Server Without Them). Too much damage too late. As implicit dependence on reliability increases, the value of checksums becomes very clear.
In the early deep space probes they learned the hard way the importance of always providing enough redundancy and error correction, because a single bit error might be the one that leads to the destruction of the communications ability of the spacecraft. One spacecraft had a corruption error like this that destroyed it for precisely this reason. They optimized out reliability to get a slightly greater data rate, and lost the spacecraft (this has happened more than once).
We’re reaching a point where we have to seriously think about whether an “optimization” is really valuable, since as Craig notes, you may not notice a problem until too late. In this age of ubiquitous computing, with plentiful processor, memory, and network bandwidth, we should be focussed on increased reliability and integrity, but old habits of a more parsimonious age die hard.
Another very recent example of ignoring the value of checksums is reflected in the recent ‘fasttrack’ problems of incorrect billing of tolls. But that’s another story…