04 April
2005

Checksums - Don't Leave the Server Without Them

UDP, performance gains, and datacenter horror stories from Sun and NetApps

Lloyd Wood commenting on an e2e post recently was asked why UDP has an end-to-end checksum on the packet since it doesn't do retransmissions, and should it be turned off. Lloyd noted UDP "could have the checksum turned off, which proved disastrous for a number of applications, subtly corrupted filing systems which didn't have higher-level end2end checks". Lloyd is exactly right here. But why would someone turn off UDP checksums in the first place - it doesn't seem to make sense, does it?


It is often the case that people turn off UDP checksums to "buy" more performance by relying on the CRC of the ethernet packet. So this is not a stupid question - it's a very smart question, and a lot of smart people get fooled by the simplicity of the process. Performance gain by turning off checksums now can be obviated through the use of intelligent NIC technologies like SiliconTCP and TOE that calculate the checksum as the packet is being received.


This is a surprisingly common problem in datacenters - sometimes the problem would be a switch, sometimes a configuration error, sometimes a programming error in the application, and so forth. I most recently experienced this problem with an overheated ethernet switch passing video on an internal network. Since we don't have things like SiliconTCP in commodity switches yet, check that switch if you're having problems. In the meantime, here's a few little datacenter horror stories to put in your pocket.


The Sun datacenter back in the early 1990's had an NFS cluster project called Sunbox - an array of workstation CPUs that did divide and conquer to build a massive file server. It used an ethernet multiplexer to dynamically split the load. To buy back performance, they turned off the UDP checksum. It worked fine until they had a bad lot of ethernet boards with substandard memories - this wasn't picked up in tests because the test units were doing resends of the occasionally corrupted packets (UDP checksums usually was turned on), and in TCP the checksums would do resends as well. It was also a fairly rare problem, and the test periods were too short to pick up on the nature of this problem easily.


But when UDP checksums were turned off in normal use, the resulting NFS requests were corrupting the filesystem (which in this case were database files), forcing rebuilds and manual repairs of database tables.


As they were about to announce and release it, they suddenly discovered this problem - they noticed the corruption and in order to determine whether it was in the high level (stack or above) or lower levels, they turned on checksums and it worked immediately.


They then examined the failed checksum packets to traceback in the lower level stack-down through the link layer to discover where the corruption occured. With logic analyzers, they were able to observe the contents going into memory from the NIC on reception was different than the contents going out of the memory and traveling across the bus to the processor.


I also ran into this at an Internet portal company where I was a manager. We were using NetApps file servers to mirror the daily information - NetApps at the time encouraged staff to turn off checksums to increase performance. The DBAs noticed problems and ended up doing frequent rebuilds, but couldn't figure out why. It took me a lot of time to convince my staff to turn on the checksums because they were told "they don't have to" by NetApps. Most datacenter staff work by cookbook, and this wasn't in the cookbook. When they finally tried it, it worked. This little problem cost us a lot of time and aggravation for very little (if any) performance gain.


Higher level checksums are worth it every time. Don't leave the server without them.

Posted by lynne : "Checksums - Don't Leave the Server Without Them" at 09:31 | link to entry
<< Fun Friday - the Curse of BSD and the Four Mistakes | Main | First They Watch the Movie, and Then They Read the Book >>