- Entries : Category [ Protocols & Networking ]
03 April
2004
Interplanetary TCP
How about a tech job where you "really get away from it all"?
Google put out a lunar job listing, for the person who really needs to get away from people.
So I went and asked Vint Cerf "Perhaps this is the first real use of interplanetary TCP?". He laughed. I think you will too.
OK, I like Google. Always have I suppose, because it's minimalist. And I like the logo - it's kind of simple and childish, but it's very Stanford. Yes, I know, I went to Berkeley, but my dad and brothers went to Stanford, so even if there's a rivalry, it's a friendly one. Besides, usually Berkeley gets the axe to grind - I'll take a bear over a tree anyday.
Posted by
lynne : "
Interplanetary TCP" at 11:22
|
link to entry
20 April
2004
How Fast Can You Go?
Melting down the Hardware but where?
I've been following the CalTech and CERN groups responsible for achieving what they claim is the "latest Land Speed Record" of 5.4 Gbps and a claimed throughput of 6.25 Gbps over an average period of 10 minutes, according to the announcement to the Internet 2 newslist on February 24th.
Of course, what does this mean? They claim that "best achieved throughput with Linux is ~5.8Gbps in point to point and 7.2Gbps in single to many configuration". They claim they're melting down the "hardware" at 6.6 Gbps. Is this true?
Here's what Mr. Cluster himself, Jim Gray, says (thank you): "we are running from Pasadena to Cern at 6.6 Gbps (800MBps) and not saturating the cpu (we are melting the NIC)." So he does confirm that it's not the processor.
Hmm, not processor but NIC? How can you "melt down" something as simple as a physical link (layer 1), or a CRC (layer 2) or even the IP processing (layer 3)? They're all pretty straightforward. You only run into problems when you hit stack for TCP, and that's done for TOE's in the processor.
Well, sounds like the limit must be the bus - getting it in and out in the first place. But why are they being so obscure about it? Only AMD and Intel can deal with the real issue - putting it on the processor itself. But then, maybe you'll get back to that fundamental problem of the stack latency processing again. And wouldn't that be embarrassing if all these new approaches still got you caught up in stack?
Kind of makes me wonder - don't you?
16 April
2004
FastTCP and SSC - A Short Meditation
Physics Corner
While we're all oohing and ahhing over CalTech's FastTCP bulk transfers and record busting using their new TCP congestion control - interesting paper (finally) by Jin/Wei/Low - contrast this with friendly rival Stanford's protocol high-speed TCP that changes the fairness (I find it interesting and provides some new ideas). Are either likely to impact anyone's use of the Internet in the next decade, anymore than studying cold fusion?
I'm struck by how all this "record busting" may be a mere sideshow in the scope of real Internet usage, especially given Microsoft Research's own Jim Gray's economic arguments against bulk transfers at Stanford a few months back.
Jim said that it is cheaper to send a disk drive via FedEx overnight than any of these contests could provide of benefit to ordinary users. Could the CalTech and Stanford work be too early given that hard reality? I leave that to CalTech and Stanford to battle out which is better a decade down the line. But what about what we can study now?
Maybe dealing with that long latency network issue that Beck etal finds makes storage jitter intractable in the first place is the real challenge of the decade.
Recently a few database experts were suggesting that end-to-end principle might be applied to databases. Beck, Clark, Jacobson, ... don't address this. The question "Are database commits end-to-end - do they satisfy the end-to-end principle?" such as that described in the simplest case (akin to a chaotic strange attractor in physics).
Another thing that came up was "When does latency and jitter combine in a chaotic way such that reliability is injured in database transactions?" Doyle at CalTech speaks of fragility vs complexity, and uses a combination of control theory, dynamical systems, algebraic geometry and operator theory to connect problem fragility to computational complexity, such that "dual complexity implies primal fragility", in an NP vs coNP way. It could be that "robust yet fragile" (RYF) is effective in defining what's necessary to prove a viable global storage system. EtherSAN approaches the problem by idealizing the simplest end-to-end mechanism, TCP, with fundamental remedies - not increased complexity. RYF would indicate that this radically improves this by removing primal fragility.
All this seems very similar to the old fusion sustained power burst we had in physics a decade ago. Kept everyone busy until the SSC debacle killed everything in the field. Plasma research is only now beginning to recover.
Let's go back to fundamentals with Clark etal on end-to-end and simply considering Beck's well-done arguments for small transactions per storage, cleaving to those goals only and not creating new ones. Reexamining definitions, and understanding them better, ala Bohr and mass, but not changing them.
20 April
2004
Speed Stunts
SLAC's PR engine results in reactions
Of course, never assume what the PR office of a university releases makes any real sense, as this SLAC press release demonstrates.
Looks like a commonplace database search trick to throttle flow control in a faster than exp backoff by probing for the likely end-to-end flow rate at any time. The question is, is this a good enough "good enough" strategy?
Jim Gray, once again, was willing to provide me a bit of perspective on this.
Jim told me "That stunt does not allow packets to get lost. There is some real engineering to make transfers at that speed actually work. But that is proceeding in parallel with the stunts." That makes me feel more confident of what I was reading. Jim's read is sensible and balanced, unlike the PR guys in the licensing office of Stanford.
I took a completely different approach with ballistic protocol processing, optimizing at key portions of the network the "best" transfer rate at that RT instant of time - it's a structural approach, really. I was uncomfortable setting an arbitrary good enough limit given the ever-changing nature of the Internet at any point in time. I found what appeared to be "good enough" was hard to prove good enough.
But of course, I trained in plasma physics, and every attempt in that area to bell the beast by setting arbitrary limits on containment has proven unsuccessful. So 40 years of research there has still left us with "good enough isn't".
So who do you think has the good enough solution? CalTech? SLAC as written up in this breathy news item? Or are they running after rainbows?
Posted by
lynne : "
Speed Stunts" at 22:17
|
link to entry
30 April
2004
Oh where Oh where did my protocol engine go?
Why FEPs are great theory, but not great practice
In a walk down memory lane, Craig Partridge and Alex Cannara discussed Craig's mention of an XCP meeting and Greg Chesson, Alex saying "But, we still have suboptimal network design, insofar as we depend on TCP from the '80s and a glacial IETF process -- all this while now having complete web servers on a chip inside an RJ45 jack! So maybe his ideas for SiProts were something to consider, even if they weren't right on target?"
For those not in the know, Greg Chesson stepped on a lot of "TOEs" (hee hee) first in the early 1990's with filing a lot of patents with protocol engines (PEI - backed by HP at the time). I have a slide from a presentation that I did for Intel back in 1997 explaining why he failed - simply put, preprocessing likely conditions based on hieristics always failed in the general case, with the preprocessor commonly falling behind the processor even though it was put there to speed up the processing - so the software stack on average was usually faster.
This same process in FEPs has been repeatedly repatented in network processors - I reviewed several - but they never got the methods that allow for completion of the processing without falling behind (esp. on checksum, but there are also other conditions). I always thought Greg could sue a number of network processor companies for infringement, but since they all fail in the same way, who the hell cares.
Greg made his money in SGI, by the way, and look how that company eventually turned out - lots of "throw code over the fence" to linux, which undermined their own sales of systems. Very self-destructive company.
08 June
2004
Hey, why don't we just turn off that pesky congestion control...
Dumb CS grad student tricks
Alex Cannara loves to push those "why don't we just turn off that pesky congestion control" papers my way. I think he does it just to annoy me. Which is correct, because I can't imagine ever getting such a paper approved. But, as Britney likes to say, "Oops, they did it again...".
It seems like every other CS grad student thinks he can get away with "disabling of TCP's congestion control" and suddenly he's solved the problem of congestion. Or, to put it in medical terms - it isn't the disease, it's the treatment. Everything is wonderful if you just stop treating the condition - even if the patient dies? Very much like a physics student thinking he's gotten around energy conservation, when he doesn't get what total energy of a system means, and wow, he's invented a perpetual motion machine.
So what's the magical fix after congestion control is turned off? Buffers! What I would give for a dime for each of these assumptions. I've just gotten a patent grant on a new memory mechanism for Internet semiconductors, and after reviewing all the other patents in the area, I can guarantee it isn't just making a bigger buffer. And so would the gentlemen from companies like Extreme and Juniper. I suppose I've got to finish up my article on buffers for Byte and get it out pronto.
Buffers make the problem worse - they increase jitter. You have to reduce the reqs for buffering by better scheduling and reduce packet sizes, not increase them (in other words, no jumbo packets - sorry Microsoft).
All I see is another "force it through" way of doing things, with no appreciation of subtlety. As a woman in tech, I can tell a lot of these guys probably can't get a date just from the way they put together their proposals. After all, one proposition is very much like another.
17 June
2004
Procket Gets Unplugged
And Cisco gets back their own
Well, it's official - Procket is unplugged and sold to Cisco. At $89M, where they invested $300M, and Cisco was an early investor, I'd say they got a bargain. But will they use the technology?
According to one Cisco insider I spoke with he think the technology isn't the big thing. "I think we regard it as a bargain: purchase 50 high end engineers, fluent in router design, ASIC design and layout, board design, SW, etc. for 80 million. Not a bad deal". As to the tech, he says simply "We are fragmented enough as it is". So they'll find a use for it somewhere but it isn't urgent.
One of the things that Cisco is moving towards is more spin-in investments, where a Tiger team is set aside as a startup, puts together a product, and then is reabsorbed back into Cisco. Makes sense for a company as big and bureaucratic as Cisco now is - and is a very low-risk way for employees to try something new.
IBM faced a similar crisis, and established the Boca Raton group who created the IBM PC. We'll see if Cisco can achieve something as industry-changing using this technique, or if they're vision is too parochial to provide anything more than busywork for their M&A staff.
15 September
2004
TCP, Hold the Congestion Control?
Packet rate, congestion control, network neutrality, jitter and choices
In an off-list discussion in the protocols interest groups, I got involved in a rather deep discussion of packet rate, congestion control, network neutrality, jitter and choices in Internet design, which are actually quite interesting to share.
A little background here - one person asked if it was true (it is) that the cwnd (congestion window) internal stack variable doesn't have an immediate impact on the network, because TCP updates its actual rate only once per RTT in the congestion avoidance phase, so the cwnd += SMSS*SMSS/cwnd update with each ACK is only an internal calculation. You got that?
Which went on to the question posed to me - "While I now believe that it would actually be 'legal' according to the spec. to implement a TCP sender like this (no one seems to say that you MUST saturate your window at all times)..."
Wait partner. Going back into the Internet Wayback Machine and chatting with some of the earlier worker bees, it turned out it actually started out this way, and congestion backoff fell right out of this.
When the Internet had 56 kbit/sec lines, we coped, but the core was too slow. But then it moved to T1, which was good, but still not great. At this point, everyone began to realize the core was just too slow for the senders. If you had everyone sending the max, it blew out the routers. The simplest version of TCP didn't have much in the way of policy then - just your initial send - but that's where slow start / slow open began. This all should be in the history sections of some of the TCP books, but you can also ask some of the people who did it, since most of them are still alive and kicking. Some folks (Microsoft in particular) don't like slow start however - they keep the session always open instead, which has it's own problems.
"I am sure that typically, a TCP sender updates not only an internal variable but the actual rate that it uses with every ACK, no matter what state it is in. It does not wait a RTT - the notion that it increases the rate by one segment per RTT is just the outcome of the per-ACK rate updates, seen over a long time interval."
Ahh, the inevitable "rate" issue. It seems reasonable, especially to a physicist, to consider rate in this context. But as a mathematician would say, is it contiguous and deterministic?
The fact of the matter is that rate calculations are fraught with peril. The rate is often too erratic because you have stochastic as well as statistical error. Earlier Internet calculations relied on the variance of RTT stats to get the greatest thruput - not do the greatest rate, but the max stable transmission. If you go beyond it, it becomes more unstable (stochastic).
So unlike a nice physics situation as the rate of speed of a particle in a particular direction, rate in the Internet sense is a statistic which is the compounded randomness of congestion in switches, arrival time variance, ... the list goes on. Error analysis is a fundamental issue, but is typically not studied by CS majors or engineers (but they made us do it in physics), but if it were, I'm sure that the real-world conditions of the Internet would make anyone crazy.
Worse yet, companies like Microsoft encourage the idea of very fat packets for their own purposes and don't worry about how it might affect overall network performance. Other experimental mechanisms distilled down to elementals simply grab whatever bandwidth is available in a forceful way with no fairness implied, like a bunch of thugs shouting "Make way" through the crowd of rabble and trampling over everyone. Again, no overall network performance - or network neutrality - is a consideration, and that is in opposition to the establishment of Internet access itself.
Let's talk jitter - not rate. This implies small packets and handling transparent congestion recovery with TCP without affecting the semantics. In this more subtle approach, we need to plan the staging of session data across the Internet without inducing any more network level jitter (or randomness). We've got to think "stable" - not play with statistics in the hopes that eventually we'll get a clear channel. It's at the the end you get stable bandwidth, but that's not what you start with. Begin with "eliminate the source of the problem" and the compound equation is bettered.
When TCP is used in a highly lossy LAN wireless network, for example, it can do a remarkably good job at coping with this impossible situation. But if we extend the lossy links (like a link or network layer technology) in subsequent hops, ultimately nothing gets through. By reducing the bit error rate on a link, the way TCP performs improves remarkably (this shouldn't be a surprise).
Just consider that if we can do network level bit error rate across network hops using network measures that are adaptive - we can reduce the problem so that the net effect on the TCP connection is that the congestion control / flow control / backoff algorithms see an idealized network where the effect of the changes imparted by the transport layer are due to what the network can actually handle over a larger time granularity, and the smaller elements of the chaos/random average out anyways. Thus, we scale the randomness, so to speak to something more appropriate for the RTT that you see (e.g. 80 millisec RTT - effects of congestion / flow control now should reflect a low-level network with an expectation that the characteristics will not change for a minimum of 2 sample times - information theoretic - and that the low-level transport guarantees that the expectation of transport will be held this way, regardless.).
The problem with optimizing rate is that, like Chaos Theory, we don't necessarily understand the impact of the effect. Anything can destablize and make it more chaotic. We can play with random against random in the hopes of cancelling it out, or you minimize the effects of the noise. A first order approach to doing this was congestion backoff and determining the bounds. It was a good idea for the time, but that time is past - it is too limited.
The limits of this approach is that you can only determine this at the endpoints so you're stuck with an undetermined situation in the middle, as we noticed in the extreme case of a lossy wireless network. If you have a RTT of 80 millisec connection, the only time you can do something is every 80 millisec - there may be many sources of chaos in between, and you can still lose out.
The only realistic approach is to divide that 80 millisec interval into 10 8millisec regions and reconcile at each point using the exact same TCP algorithm - in other words, simply break the problem down with divide and conquer. The limit was to complete full transport processing compliant with all interoperable goals timely and also comply with the end-to-end principle.
This has been done in the lab - we made a 10-fold improvement in the situation and we did it in 1998 at InterProphet! And we didn't rely on statistics in an increasingly complex Internet structure, nor have to prove a new algorithm, since it's the same algorithm. We just constrained in more places.
Actually, the smaller the segments, the better it works, because it adapts for the segments in a realistic way. Also the elements where most of the effect occurs is chiefly as it goes off the network, going from high to lower bandwidth environments at the edge.
Remember, congestion control and backoff are biased historically toward the core. We've got plenty of capacity at the core now. It's the edge that is the problem, and will remain lagging for the next 20 years. That is where the work should concentrate, because that's where the economics dominate. And hence that's where the money is.
Where everyone goes wrong is that they think the point is to eliminate the responsibility for the transport layer and thus effect semantics, because it's larger than 2x RTT level. We see looking at the stability is an enhancement that is bounded by the visibility of the mechanism.
It's an argument between planned use and stochastic use. Some opportunists see the success of this algorithm in dealing with a completely different problem at the core in a different time with a different set of conditions and think they can use butterfly's wings with more powerful tools in a new age of Internet use with a completely different set of conditions and problems emerging. In other words,the statistics are taken far beyond what is meaningful or possible. It's a really difficult problem to discuss, somewhat like dealing with a "chemist" who believes he can transmute metals with the right kind of lodestone. The alchemist mistakes the real success of chemistry in certain reactions with the assumption that all reactions are the same if you find the right catalyst. He doesn't do the hard work of really understanding how things work, and the consequent limits. Too many people don't study history and philosophy, especially in science, but that's where the disconnect comes.
We've got to understand and remove the randomness that clouds the issue - not just "control" it. Perhaps working in plasma physics, I have a pretty good feel for the limits of statistics in controlling randomness, as the history of plasma research demonstrates.
I had to consult with others on this matter today, as it is a deep and thoughtful one, and it has led to a long set of discussions. I hope their insight helps others in this matter as well.
07 October
2004
From the Mailbag - Buffer, Buffer Where is the Buffer?
Security and buffering
In my current article Buffer, Buffer, Where is the Buffer? in Byte, Jim S. sent me the following:
Hi Lynne,
Nice article in Byte. It reminds me of the old days
when you could read a good technical piece in the print Byte.
Kind of a rare phenomenon today.
But do you really mean to say that *all* security
problems are buffer problems?
Thank you Jim for your kind words. Could you please tell the editor of Byte as well? That way, more articles like this come the reader's way. :-)
No, obviously security isn't just buffer overflows. But these little bandaids are everywhere, and cause an amazing amount of problems for something so trivial.
For example, on Cnet today another buffer overrun afflicting Windows was announced. "Secunia issued an advisory saying a buffer overrun flaw has been found in Office 2000, and potentially also in Office XP, that could allow hackers to take over a user's system. The company rated the flaw as 'highly critical.'" Alas, these bulletins are all too common.
I used the essay to illustrate that a one size fits all solution like a buffer can have larger implications than my "engineer" in the introduction realized, and that his solution may not be a solution at all. There's a lot of sloppy thinking nowadays, and that doesn't help in a more competitive global economy. I'd like to see fewer unemployed obsolete engineers and scientists, and more innovation and critical thinking. So I write these essays. I hope it helps. And I hope you continue to enjoy them.
14 October
2004
A Tisket, a Tasket, I've Lost My TCP Packet
Network path analysis on a lossy network
A gentleman today wondered if his expensive leased fibre line was causing packet loss, even though he compared it with an ADSL line from the server to the host. As Dennis Rockwell of BBN pointed out "What you have discovered is that your 2Mbps link is not the bottleneck; that lies elsewhere in your network path. The extra bandwidth of the fiber link cannot help this application".
Dennis is correct. But how do you know where to look to fix the problem? Here's a little story from a manager of international datacenters in Japan and the US to illustrate how complicated the issue can become...
"There was a weak laser in an inexpensive optical ethernet LAN connection used to convey a WAN from one floor of a datacenter in Ariake to the other, resulting in a significant increase in bit error rate. While the laser was within spec for the product, the product was used on a fibre at the edge of the distance of the link - but still also within spec.
The simplest thing to do was to replace it with an optical ethernet WAN connection more suited to the use. Unfortunately, the datacenter insisted that you could only use this inadaquate connection to go between floors. The rule could not be changed to repair this situation. Everyone along the path acknowledged the problem, as they all were blamed, but still the rule 'could not be changed'.
This problem impacted Japanese consumers using one of the most heavily trafficed Japanese websites at the time. The problem persisted for months until we were able to consolodate on one floor only." (from my book on datacenter management and operations).
15 October
2004
A Tisket, A Tasket, I Lost My TCP Packet Part II
Doesn't Anyone Test Equipment Anymore?
Well, my Japanese datacenter manager story hit a bit of a nerve, with one reader asking "doesn't anyone test equpment anymore?" You're correct. This was the first question in this incident. Didn't anybody test anything? Yes, they did, as did the datacenter. Here's the continuing saga of A Tisket, a Tasket, I've Lost My TCP Packet direct from that datacenter manager.
"Never got a straight answer (this is Japan), but we believe that it appeared fine on initial use, then degraded. The supposition among the datacenter support staff for the company (not the Ariake datacenter staff) was a defect that resulted in a soft error (like perhaps a weak component). Power off / power on - fine for a while. But as soon as we got under serious load, poof.
So it hit the first load day. And thereafter.
In the US, this would be sufficient grounds for replacement (once is enough), but because it appeared sporadic, and because the Ariake staff had already tested it and it was fine, they would not replace it. They presumed we were increasing their costs to force a cancellation of a contract (there was no intent to do this, but you can't get inside people's heads and argue with their fears easily). It's partly a cultural issue - the Japanese are very good on hard error situations, but don't take well to soft / sporadic situations in a damage control society because someone's going to get blamed, and that ruins careers.
In sum, they couldn't percieve it as damaging the quality. They in truth did not have the correct services in place to move a client from one floor of a datacenter to another (they were expanding) and the supplied LAN service was marginal for use on the interior of the 3 tier datacenter (like app server to database, because retransmission didn't really cost you much), but the problem was they used on tier 1 and that did make a major difference.
Sometimes you run into social situations where people get absolutely fixed on believing they have an adaquate situation. It was failing - but by then everyone was fixated on CYA. Yet the problem was very real. The site was one of the most popular in Japan. But it still was in Japan, and one has to play by the cards dealt - not the ones you'd prefer.
That's when the managers like me really earn their pay".
21 October
2004
IP Dogma and Cognitive Networks at the Stanford Networking Research Center
A chat with Shannon Lake
I came across a scheduled talk at Stanford Networking Research Center this week on "Cognitive Networks: Implementing Alternate Network Management & Routing with Software Programmable Intelligent Networks" by Shannon Lake, CEO of Omivergent. Unfortunately, I had another seminar to attend at exactly the same time (as usual), but I was curous about this talk and Mr. Lake's assertions on "IP dogma". So I went and asked him why do we need to "change our views on IP networks", layer 3 versus layer 4, and the impact of jitter. He most kindly replied.
SL: "Wow, where do I start. There are quite a few things that we, as an industry, believe are taboo. One of the big issues with IP has to do with business model and the business models available in an IP based network. I am going to be focusing on the realities of IP such as what it means to be connectionless and stateless, with decentralized control logic. The economic implications are staggering not to mention put all signaling in-band and you have one big insecure, economic mess."
LGJ: "Do you deal with layer 4 at all in your analysis, or do you keep discussion at layer 3?"
SL: "On Layer 4… networks act differently layer 3 and below than layer 4 and above. Connectionless and connection-oriented mean different things in these two realms. I fundamentally believe that there are 3 planes – a transport plane, switching plane, and a management plane. The network can be abstracted into these 3 planes and then managed based on connection-oriented and connectionless transport (plane) operation. Fundamentally, I believe that rather than overlaying control, we must underlay control. (This topic requires quite a bit more discussion and it is not this simple, but it is a different way to look at networks). This yield much more determinism in networks along with the availability of new economic models. To answer the question, I will focus mostly on layer 3, but I will also get into tying layer 4 into layer 3 functions."
LGJ: "And finally, are aspects of jitter on the network of relevence, given your telco focus?"
SL: "Issues on jitter… we go way beyond Telco and jitter... some of the network modeling have been done for wireless ad-hoc infrastructure and yielded significant reductions in overhead. As for jitter we use a classification called HoS or Heuristics of Service to categorize a circuit, flow, user, event, etc. by jitter, delay, latency, cost, location, carrier, security level, etc."
LGJ: "Anything else to add?"
SL: "Wow, pointers… I have taken on discussing some of these issues online. We have many issues with performance, and security along with the top 3 layers of the models Layers 8, 9, and 10; the religious, economics, and religious layers. For instance, you cannot secure a flow if the header information contains all the information you need to know who is sending the header and who is receiving it. This alone opens up a network for interception, replay, DDoS, masquerading, man-in-the middle attacks and the list goes on... I have countless example of ways in which we as an industry turn a blind eye because we fundamentally believe that the IP network is the cure-all solution. Too bad you will miss the presentation."
Well, I wish I could have made it, but for me and everyone else who couldn't here is a pdf of Mr. Lake's presentation at SNRC. I hope you all enjoy it. And thank you Mr. Lake for your time.
23 January
2005
Mediapost Cyberbullying 101 - "Everyone's Doing It"
Lynne Jolitz - tech adopters just led the way
Mediapost published an essay on cyberbullying - "Cyberbullying has suddenly entered into popular consciousness." So it's a new phenomenon, right? Nope, it's been around as long as electronic communications made it possible. It just wasn't as visible since there were fewer channels of communications, plus if someone acted up you could get them thrown off. Now, in a global Internet, there is plenty of places to hide and plenty of eyeballs for venom - you just gotta know where to look.
10 February
2005
Weird Windows XP TCP Behavior
Why wait for the window - just do it?
Sam Jansen of Wand Network Research Group in New Zealand recently complained of "weirdness" with Windows XP (SP2) and TCP when doing a TCP test of two systems connected over a link (very similar to those demos at InterProphet). After all, what could be simpler than a couple of cans and a string, right?
No, it isn't simple. He finds that Windows is sending data outside of the receiver's advertised window, as well as sending "weird sized" packets in what is supposed to be a "simple bulk data transfer (often sending packets with a lot less than an MSS worth of data)". What's going on here?
Poor Mr. Jansen is not losing his mind - what he is seeing is real and Microsoft is cheating. We saw lots of little cheats like this when we were testing on Windows for SiliconTCP and EtherSAN back in 1998 to the present. In sum, Microsoft does this because they think they "get ahead" with a technique called "oversending". It thrives because TCP congestion control algorithms are all pessimistic with the send budget. It doesn't always work, like any cheat, but I guess it makes them feel good.
04 April
2005
Checksums - Don't Leave the Server Without Them
UDP, performance gains, and datacenter horror stories from Sun and NetApps
Lloyd Wood commenting on an e2e post recently was asked why UDP has an end-to-end checksum on the packet since it doesn't do retransmissions, and should it be turned off. Lloyd noted UDP "could have the checksum turned off, which proved disastrous for a number of applications, subtly corrupted filing systems which didn't have higher-level end2end checks". Lloyd is exactly right here. But why would someone turn off UDP checksums in the first place - it doesn't seem to make sense, does it?
It is often the case that people turn off UDP checksums to "buy" more performance by relying on the CRC of the ethernet packet. So this is not a stupid question - it's a very smart question, and a lot of smart people get fooled by the simplicity of the process. Performance gain by turning off checksums now can be obviated through the use of intelligent NIC technologies like SiliconTCP and TOE that calculate the checksum as the packet is being received.
This is a surprisingly common problem in datacenters - sometimes the problem would be a switch, sometimes a configuration error, sometimes a programming error in the application, and so forth. I most recently experienced this problem with an overheated ethernet switch passing video on an internal network. Since we don't have things like SiliconTCP in commodity switches yet, check that switch if you're having problems. In the meantime, here's a few little datacenter horror stories to put in your pocket.
The Sun datacenter back in the early 1990's had an NFS cluster project called Sunbox - an array of workstation CPUs that did divide and conquer to build a massive file server. It used an ethernet multiplexer to dynamically split the load. To buy back performance, they turned off the UDP checksum. It worked fine until they had a bad lot of ethernet boards with substandard memories - this wasn't picked up in tests because the test units were doing resends of the occasionally corrupted packets (UDP checksums usually was turned on), and in TCP the checksums would do resends as well. It was also a fairly rare problem, and the test periods were too short to pick up on the nature of this problem easily.
But when UDP checksums were turned off in normal use, the resulting NFS requests were corrupting the filesystem (which in this case were database files), forcing rebuilds and manual repairs of database tables.
As they were about to announce and release it, they suddenly discovered this problem - they noticed the corruption and in order to determine whether it was in the high level (stack or above) or lower levels, they turned on checksums and it worked immediately.
They then examined the failed checksum packets to traceback in the lower level stack-down through the link layer to discover where the corruption occured. With logic analyzers, they were able to observe the contents going into memory from the NIC on reception was different than the contents going out of the memory and traveling across the bus to the processor.
I also ran into this at an Internet portal company where I was a manager. We were using NetApps file servers to mirror the daily information - NetApps at the time encouraged staff to turn off checksums to increase performance. The DBAs noticed problems and ended up doing frequent rebuilds, but couldn't figure out why. It took me a lot of time to convince my staff to turn on the checksums because they were told "they don't have to" by NetApps. Most datacenter staff work by cookbook, and this wasn't in the cookbook. When they finally tried it, it worked. This little problem cost us a lot of time and aggravation for very little (if any) performance gain.
Higher level checksums are worth it every time. Don't leave the server without them.
12 April
2005
Checksums and Rethinking Old Optimization Habits
From BBN and NFS to space communications and data error
More war stories on checksum failures over the years. Craig Partridge recalls "some part of BBN" experienced an NFS checksum issue and that it "took a while for the corruption of the filesystem to become visible...errors are infrequent enough that NIC (or switch, or whatever, ...) testing doesn't typically catch them. So bit rot is slow and subtle -- and when you find it, much has been trashed (especially if one ignores early warning signs, such as large compilations occasionally failing with unrepeatable loading / compilation errors)". Craig is absolutely right - this was exactly the case with the Sunbox project I described as well as the datacenter mirror example (see Checksums - Don't Leave the Server Without Them). Too much damage too late. As implicit dependence on reliability increases, the value of checksums becomes very clear.
In the early deep space probes they learned the hard way the importance of always providing enough redundancy and error correction, because a single bit error might be the one that leads to the destruction of the communications ability of the spacecraft. One spacecraft had a corruption error like this that destroyed it for precisely this reason. They optimized out reliability to get a slightly greater data rate, and lost the spacecraft (this has happened more than once).
We're reaching a point where we have to seriously think about whether an "optimization" is really valuable, since as Craig notes, you may not notice a problem until too late. In this age of ubiquitous computing, with plentiful processor, memory, and network bandwidth, we should be focussed on increased reliability and integrity, but old habits of a more parsimonious age die hard.
Another very recent example of ignoring the value of checksums is reflected in the recent 'fasttrack' problems of incorrect billing of tolls. But that's another story...
13 June
2005
ACM, Turing and the Internet
Vint Cerf and Bob Khan feated and awarded
Vint Cerf and Bob Khan got a well-deserved dinner and party in San Francisco courtesy of the ACM. A collection of Internet "who's whos", lots of wine and speeches, and most importantly, their coveted Turing award. This award was announced several months ago. As Vint noted in an email reply to the Internet Society a month ago (try to take notes during an awards dinner - it can't be done), "What is most satisfying about the Turing Award is that it is the first time this award has recognized contributions to computer networking. Bob and I hope that this will open the award to recognize many others who have contributed so much to the development and continued evolution and use of the Internet."
So congratulations to Vint and Bob. I'm sure we are all very pleased that they have been honored with the Turing Award this year. They both deserve it - their work has changed our world!
27 June
2005
Jitter, Jitter Everywhere, But Nary a Packet to Keep
Jitter isn't important - until you want to use streaming rich media (audio/video)
I was looking over the end-to-end discussion on measuring jitter on voice calls on the backbone and came across this little gem: "Jitter - or more precise delay variance - is not important. Only the distribution is relevant". This dismissive little item of a serious subject is all too commonplace, but misses the point the other researcher was making.
The critic assumes "fixed playout buffer lengths (e.g. from 20 to 200ms)" to calculate overall delay. But do these buffer lengths take into account compressed versus uncompressed audio? If not, the model is faulty right there. The author admits his approach is "problematic" but then assumes that "real-time adaptive playout scheduling" would be better - but then the measurement mechanism becomes part of the measurement, and you end up measuring this instead of the unmodified delay - which doesn't help the researcher looking at jitter and delay measurements for voice.
But there is a more fundamental disconnect here between our voice-jitter researcher and his jitter-is-irrelevent nemesis - jitter does matter for some communications - it just depends on what problem is being solved. And it is careful definition of the problem that leads to dialogue.
There are two modes of thought here. One is that if the packet arrives a little earlier, than a little later, then we're all OK because in the grand scheme of things the "overall delay" is what matters - not the little perturbations. So in this case, buffering alone is good enough because it is elastic enough to recover. These guys work a "classical" model in the sense that perturbation theory is just a math game and instead view the variations within defined limits as "cancelling out".
But what if perturbations like jitter do matter? (It helps, I suppose, to be trained in physics instead of statistics - most of the Nobel work in the last 100 years has been based on analyzing perturbation). The problem with our classical model lurks in the underlying assumption that the buffer is elastic enough to recover because we are dealing with a bimodal distribution. But what if the jitter isn't bimodel? You could get successive lags, for example, on a video transmission that results in decreased quality as the bandwidth expires due to congestion backoff. So jitter can have real-world results.
In effect, with a jitter distribution that isn't "nice" (e.g. bimodal), any work done at the beginning of a session to choose the right video stream based on scale, size, quality for the bandwidth to support it is now wrong. I suppose we can try to compensate for this, for example, by taking our video stream set up for 300 kbps and overcompress it to 220-240 kbps to deal with any "droop" in bandwidth to avoid taking the hit on quality.
The problem with doing this is that you have pre-degraded the video stream to ensure you get through without degradation, which seems a bit self-defeating. Plus, what about the obvious issue of synch and voice lag? If substantial variation - jitter - occurs outside of the compression block for audio information, the encoder can't isochronously decode to audeo because, in effect, the elastic buffer is working off of compressed audio while your ear is working off an uncompressed stream algorithm. And yes, what the listener hears is fundamentally the definition of quality.
Between these two different "algorithms" (your ears and an elastic buffer), the dissonance (synch error) can result in a pop. These are psychoacoustic effects that occur that degrade the quality of experience for the person on the other end of the line. They are very real, and jitter is the cause. But because the average variance isn't out of spec, no one will find a problem with the data transfer itself because they aren't looking for jitter.
So our jitter-voice researcher is absolutely correct in his hunt for more precision on perturbation. Elasticity is a basic concern for rich media (voice / video). Our jitter-is-irrelevent critic is also right in assuming that jitter doesn't matter for other types of communication like email or bulk file downloads that await completion. But like the physical world, we have a quantum realm that creates lots of wonderous effects. One should never be dismissive of subtlety.
29 June
2005
TCP Protocols and Unfair Advantage - Being the Ultimate Pig on the Bandwidth Block
Benchmark test results of ScalableTCP, FastTCP, HSTCP, BICTCP thwart RTT fairness policies
Little item from the testing side of proposed TCP protocols on stack fairness from the Hamilton Institute at the National University of Ireland, Maynooth.
According to Douglas Leith:
"In summary, we find that both Scalable-TCP and FAST-TCP consistently exhibit substantial unfairness, even when competing flows share identical network path characteristics. Scalable-TCP, HS-TCP, FAST-TCP and BIC-TCP all exhibit much greater RTT unfairness than does standard TCP, to the extent that long RTT flows may be completely starved of bandwidth. Scalable-TCP, HS-TCP and BIC-TCP all exhibit slow convergence and sustained unfairness following changes in network conditions such as the start-up of a new flow. FAST-TCP exhibits complex convergence behaviour."
What's this mean? Simple. In order to get more for themselves these approaches starve everyone else - the "pig at the trough" mentality. But what might work for a single flow in a carefully contrived test rig can immediately start to backfire once more complex "real world" flows are introduced.
There have been concerns for years that these approaches could wreak havoc on the Internet if not carefully vetted. I'm pleased to see someone actually is testing these proposed protocols for unfairness and the impact on network traffic. After 30 years of tuning the Internet, taking a hammer to it protocol-wise isn't just bad science - it's bad global policy.
06 March
2006
Why Keep Alive "KeepAlive"?
TCP security flaws in an insecure age
Keepalive in TCP has always been controversial, since it blurs the difference between a dead connection and a moribund connection, or as Vadim Antonov puts it "the knowledge that connectivity is lost". Advocates, in contrast, believe that the effort reclaiming resources needn't be done and hence as David Reed puts it "there is no reason why a TCP connection should EVER time out merely because no one sends a packet over it." Antonov expresses a very narrow affirmation of the value of retained state which is not necessarily useful in the time required, while Reed expresses the reductionist philosophy that no effort should be expended without jusification even if the basis of the repudiation is inherently faulty. But is either truly getting to the heart of the issue? Is it truly important to cleave to the historical constraints of the past Internet philosophical design? Or should we consider the question in the context of what is relevent to the Internet today?
I don't ask these questions frivolously, but with a serious intent. While I am a student of history, and find the study of heritage very valuable in technical work (even patents require a love of reasoning over time), we should occasionally look at the world not as we would like it to be or how it was but how it is. Thus, I suspect the question should actually be "What is the point of having a long-lived TCP session with keepalive in the 21st century"? Is this not a security hole ripe for exploitation in an age of ubiquitous bandwidth and zombie machines? Is not the lack of security and credentials in the modern Internet the bane of both service providers and users? This is the heart of the issue.
The resource allocated to build a socket is minimal if well-designed in a processor / memory rich environment. Fears about OS resource allocation should not be considered as justification for putting keepalive in transport - that's a red herring, about as silly as libtelnet.
Just as in operating systems, the silly debate over a minor mechanism is awash in nostalgia (since the justifications on either side lack conviction) and is not a matter of forward-looking design. As long as design is a matter of redoing the past, keepalive will live on because no one need justify its existence. But good design does not demand the use of keepalive and its use bogs us down dealing with issues of security and operation of the Internet. That doesn't sound like a good design trade-off - does it to you?
28 September
2006
The Minutia of Getting a Flash Video to Play Right Every Tme
Video, router timeouts, and stale caches
OK - you've got it all together. The video is ready to download and play, it's tested, we've watched it, the flash works (or Quicktime or whatever vintage you prefer). We watch customers watch it over and over. Things are going great. Then, someone somewhere tries to download it over the web, and it fails. The refresh button is hit over and over, it continues to fail, and that disappointed person just gives up. Why didn't it play?
Looking over the logs today provides a window into just how difficult it is to provide 24/7 perfect video streaming to any type of computer anywhere. These problems vex the biggest and smallest vendor because they are based on architectural flaws so fundamental that these occasional failures are impossible to guard against.
So why did this one customer not get the same video experience everyone else did? Well, it was really just the luck of the Internet router draw.
It first failed because the session created from client to server did not have enough persisistence (yet) to ensure that subsequent content transfer could occur before the flash application timed out.
One common reason is that a distribution router in the server's ISP installed an entry for that particular packet flow that was deleted almost immediately after creation because of too much transient load - too many streams at that point in time. The allocator handed out an entry for the session on the inital syn, but that entry only lived for a packet or two before it was deallocated.
Subsequent retransmission ran into a different problem - no entries could be created and the client timed out with a partial content transfer which was not enough to play.
What happened on the client's refresh when it didn't play? The client's refresh refused to check the entry and kept using the local cached short file insufficient for playback. After hitting refresh 9 times, the viewer gave up and watched another flash movie from the same site using the same player - and this time it played perfectly! It worked because by this point in time enough activity was associated with the routing entry so it could not be easily purged.
The most significant flaw was the browser stale cache. The second flaw was not providing the initial route entry with enough lifetime for the TCP flow control mechanism to do its job.
The moral of the story - with the built-in fragility of transient conditions coupled with poor design and the need to transfer larger files like video - "You can't please all of the people all of the time".
21 May
2007
Boom and Bust in IP Address Space Land
ARIN no longer allocating IPv4 protocol IP address blocks - Is this a turning point for IPv6?
Dave Reed on e2e notes a very interesting item - ARIN has announced that migration to IPv6 is now mandatory for allocation of contiguous IP address space. "I still remember debating variable length addressing and source routing in the 1970's TCP design days, and being told that 4 Thousand Million addresses would be enough for the life of the Internet" Dave crows. But is this an accurate "read"? (I know Dave won't mind the pun, as he's heard it many times before).
As I commented on e2e, I remember that debate as well. But the whole genesis of why 32 bits was good enough was an (underjustified) view on the use of networks rather than an understanding of how sparse addresses were actually employed. Everybody knows hash tables work best mostly empty - the same may be true with address blocks because they are allocated in routable units. But how does this really work?
The presumption of over 200 (254, or 252 for the annoyingly picky) Class A networks, each with about 16 million hosts (16*1024*1024-2 for the pointlessly obsessive) was the most hilarious, because nobody could explain how you could deal with a single network with 16 million hosts, much less some 200 of them. The claim was that a phone company using X.25 PADs might have 16 million subscribers connected in an odd configuration, but it was never to my knowledge deployed because of cost considerations. Actually, when it became possible with ISDN, it wasn't considered desireable by Pac Bell (Scott Adams was still in San Ramon then, BTW!) either as a business or technically (too much of a load for them to feel comfortable).
The counterpush at the time was for 64-bit object identifiers (for an unrelated project) - a ludicrously overblown number. For fun Ross Harvey calculated that 2^64 printed punchcards stacked one on the other would reach farther than the earth-sun distance. So one could go overboard in the opposite direction to little real purpose as well. I include this little item for those who like to test the waters by tossing in the baby.
So what this bulletin claims is that we are now out of sparse space and into dense space, i.e. if one did a fractal map of the 4 GByte address space it would have little unallocated. If true, this is an interesting juncture for ARIN and the IPv6 community. But is it true?
The concerns expressed over the exhaustion of IPv4 address space are similar to the concerns expressed over the exhaustion of telephone numbers. The assumption was that everyone had to have multiple cell phone numbers plus LAN lines plus separate computer lines and so forth, so the estimates ran from 4-10 lines per person. Area codes were split to accomodate new growth, and the press began to run stories about how we would run out of telephone numbers, contributing to a general hysteria. Companies began to over-order on blocks of phone numbers to "build-in" room for growth on their switches. In one Internet portal company where I managed their datacenter, I also had to budget this item, even though I noticed my staff was increasingly using mobile devices and advocated a single number to mobile redirect policy. This number peaked in the late 1990's and has since fallen as technologies were developed to combine voice / data, and LAN lines are wholesale abandoned for purely mobile devices.
Like the overbuying of phone lines of the 1990's, startups are often encouraged to budget for /19's even though the number of IP addresses actually used are very few, because the security demands of monitoring and securing open ports of such a large number of IPs overwhelm the IT staff, who in turn go to NAT (sometimes they really go overboard and reduce too much). As the IPv4 address space is used up and grows more expensive, perhaps there will be a similar collapse, where there are plenty of scattered small blocks which can be bartered among service providers. In this case, the ARIN announcement may simply be the peak before the drop.
So this is also an exciting opportunity for anyone who likes to make wagers. Is this the peak before the drop, like the phone number exhaustion hysteria of the 1990's, or is it something much deeper? Since even Dave notes that nobody has that many addresses in use already (sorry e2e - NAT is here to stay), yet ARIN is claiming otherwise, we may simply be seeing the end of an era of IP address block hoarding and not the beginning of a new address space boom.