While nothing is definitive yet, and Skype is still being quite vague about it, the best information suggests that the root cause of Skype’s outage is the P2P aspect of the network and not due to the notorious “central servers”.
My previous post questioned the “common wisdom” that P2P services are generally more outage resistant. Even well respected experts can fall into this trap. Yesterday, Tom Evslin said:
P2P services are generally more outage resistance [sic] than services which depend on a centralized bank of servers…
It appears that the cause of Skype’s outage is specifically a result of their P2P aspect not, as most have guessed, a problem with their “central servers”. I can remember theorizing about this several years ago, well before Ebay purchased Skype. Myself and several colleagues had guessed about a cascading failure potential pathology with the DHT (distributed hash table) and supernode model. There seemed to be a possibility where the system could fall off a cliff if a certain percentage of supernodes became unstable. If this happens, it can take a long time for the network to sync up and become stable again.
It’s not clear that this is exactly what is happening, but it appears to be something of this sort. Basically, Skype depends on selected end users’ computers out there to act as database servers and perform other network services (“supernodes”). In order for the network to operate, a certain number of these supernodes must be active on the network. Given that these computers are in the hands of end users and not in any way under Skype’s physical control, they can go off-line at any time. Once Skype “hits the wall”, since these supernode boxes are also serving as Skype clients for end-users, if they start acting funny, the user of that machine might reboot or otherwise restart or shut down the Skype application. That’s one less supernode on the network… and so on, and so on. It could theoretically take DAYS for this pathology to end.
I have mentioned this fundamental problem with Skype specifically in several contexts in the past. In response to the argument that one can protect themselves from becoming a supernode, I said:
I believe Skype is not viable if everyone (literally) takes your advice and connects their PC behind a firewall. Us “smart ones” behind firewalls depend on the “dumb ones” not behind firewalls to route our Skype packets for us. Hence we are back to my original question.
As a counter to the “Skype proves that NAT is not a a problem” position, I said:
the Skype NAT hacks depend on super-nodes, and ultimately, nodes that are not behind NAT. If every Skype node were behind a restrictive NAT, the whole Skype network would fail. Skype would have to deploy NAT-free servers of their own.
Many people have countered that Skype does deploy supernodes of their own. This outage suggests that such may not be the case, or that Skype does not deploy ENOUGH such supernodes to ensure the stability of their network.
I’m not saying that P2P is “bad” or that centralized servers are “good”. I’m merely suggesting that things are seldom black and white. Building reliable large-scale systems depends on a lot more than whether it uses P2P or not.