16 March
2006

Microsoft's Ultimate Throughput - Change the Compiler, Not the Processor

Herb Sutter, Non-Von Neumann dataflow architectures, and software concurrency

I like people who go out on a limb to push for some much needed change in the computer biz. Not that I always like the idea itself - but moxie is so rare nowadays that I have to love the messenger despite the message. So here comes Herb Sutter, Microsoft Architect, pushing the need for real concurrency in software. Sequential is dead, and it's time for parallelism. Actually, it's long overdue in the software world.


In the hardware world, we've been rethinking Von Neumann architecture for many years - SiliconTCP from InterProphet, a company I co-founded, uses a non-Von Neumann dataflow architecture (state machines and functional units - not instruction code translated to Verilog because that never works) to bypass the old-styled protocol stack in software, because an instruction based general processor can never be as efficient for streaming protocols like TCP/IP as our method. Don't believe me? Check out Figures 2a-b for a graphic on how much you wait for store and forward instead of doing continuous flow processing - the loss for one packet isn't bad, but do a million and it adds up fast.


It's all about throughput now - and throughput means dataflow in hardware. But what about user-level software applications? How can we get them the performance they need when the processor is reaching speed-of-light limits? If on a typical processor from one end to the other end you get one clock cycle at the speed of light at 7-8 GHz, anyone stuck in sequential processing will be outraced by Moore's Law, multiple cores and specialized architectures like SiliconTCP.


Herb is a great presenter, and does know his stuff. He also looks the MS part - works out and has those grand gestures of the inspirational speaker down pat. But he has a tough role to play - convince software developers to take some responsibility for the performance of their apps. Because, as he puts it, "the free lunch is over". He also sees processors reaching the limit on performance for today's sequential applications, and that the world will be full of non-Von Neumann machines to get performance. The monolithic kernelized systems ranging from Linux to old-style BSD to Windows are at a serious disadvantage here - we do need blocking for exceptions, for example - and even a completely modularized kernel like 386BSD, while a lot more efficient in performance, still waits a bit on processors. But it's the apps in userland that really suffer here (even if we avoid ring crossing), and clustering the apps custom-built means older software vendors won't be competitive - unless they change their tools, their techniques, and even the way they do product and market requirements planning.


Does this impact the server side? Not really - products like SiliconTCP take the weight of parasitic communications overhead off server processors, and since most requests are independent (like a three-tiered client-server transaction) but similar in action, we can use carefully structured data and shared memory resource to always fill up the processing queue in a highly concurrent format. But a client machine is a different beastie - it doesn't usually share copies, and the shared data is "unstructured and promiscious".


Herb sees locking issues as the major bane to concurrency in applications, and I agree. Waiting on locks is a very delicate practice in the OS itself (see Volume 1: The Basic Kernel for a detailed discussion of locks and threads). And he also sees the need to delineate different levels of abstraction based on object oriented concurrency and parallelism. So what's his approach?


This is a big problem, and his answer is small - too small to work, because its too late to do the small idea in an Internet world. He thinks changing the compiler to put in a mechanism for instantiating a lock once, instead of every time is the key, coupled with a mechanism he calls "futures" to hold the guess of what might be the result. The latter isn't much different from Sun's prefetch/precalculate approach for Niagera, with all of its benefits and drawbacks (most importantly, it didn't help enough according to Sun's own stats).


But even these small changes met with stout resistance and objections. Herb is right in saying that once we go parallel in software our simple test cases don't hold - race conditions and lock contention are a real problem in the new world order. But his approach, for all of his strong words, is too little precisely because he doesn't wish to inconvenience the very developers who will live and die by throughput. So the OS will get bigger, vendors won't change their apps in userland to use multiple cores effectively, and the reasons for upgrading expensive software packages will become fewer and fewer.


So what this really is about is the ever widening gap between hardware and software. Since the only way hardware can go faster is multiple cores and non-Von Neumann architectures, they will go that direction. If software developers choose to stay in their sequential programmatic direction, they'll be capped. And maybe, just maybe, hardware designers will extend their offerings more and more into the realm of software. So instead of worrying about C++, maybe it's the beginning of a Verilog world. Wouldn't that be something?

Posted by lynne : "Microsoft's Ultimate Throughput - Change the Compiler, Not the Processor" at 11:29 | link to entry
<< Knight-Ridder Sold, But San Jose Mercury News Goes on the Block | Main | Fun Friday: Happy Birthday 386BSD! >>