Reliability in the cloud: hidden dependencies

The much discussed outage of Skype last month and its eventual attribution in part to Windows Update— which itself was functioning perfectly normally as designed– leads to a number of interesting observations.

  • Distributed peer-to-peer systems were heralded for their reliability, owing to the lack of any SPOFs, or single-point-of-failure. Skype routes calls using a P2P network of machines owned by its own members (although authentication is centralized) and if there is anything in plentiful supply on the Net, it’s machines with idle time/bandwidth. The outage suggested that the parts were not quite as loosely-coupled as classic distributed systems theory would have one believe– they could fail in quite coordinated manner because they all sport the identical configuration.
  • Diversity fans will probably jump at the occasion to point out the evils of software uniformity. If some larger fraction of the Skype clients were running Linux or Mac, they may not have rebooted at the same time and spared the outage, the argument runs. But this is unlikely to make a quantitative difference as even the egalitarian market divided three-ways between Windows/Linux/Mac would have substantial number of nodes of any variant. Also it is possible to get diversity of behavior on the cheap without diversity of platform– in this case, randomly spreading apart the patch installation/reboot will do the trick.
  • This was a completely unexpected interaction between 2 cloud services, one for VoIP and one for software distribution. It’s taken for granted that two client applications  installed on the same machine can have allergic reactions and blow up the machine. (This is the well-known DLL hell problem in Windows.) But the dramatic demonstration that WU, a service hosted “out-there” in the cloud could impact another completely independent service hosted elsewhere is news.
  • What does this mean for those engineering services? Keeping Skype up and running was not in the design criteria for WU. If anything getting security patches out to vulnerable machines ASAP would have increased the pressure on Skype by rebooting all machines quickly.  A lot of software these days has auto-update capability, mostly poorly designed and not even giving the users chance to consent. It’s not a stretch to assume that one could initiate a forced reboot of most  Windows or Mac machines in quick succession. Is Skype at fault then for depending on the uptime of  machines that it has no control over? The architects did the right thing and hedged their bets statistically by requiring some fraction of their nodes to be operational. Is that more of a gamble than building a giant data center stacked with wall-to-wall racks of servers?  (It’s certainly cheaper and more efficient, and environmentally friendly considering the power usage of the modern data-center. And redundancy would have required multiple DCs, geo-located around the world.) Until now the gamble worked correctly but one day WU pushed the system beyond its  critical threshold.

cemp

Firefox and the IE tab: a lesson in inter-operability

During the browser wars of the late 1990s, Internet Explorer edged out Netscape Navigator as AOL’s choice of browser. Discounting any business reasons, many people attributed this to a design choice made by IE: it was componentizable. While the Department of Justice focused exclusively on the extent the browser itself could be decoupled from the operating system, developers worried more on whether the individual pieces (networking, HTML rendering, user interface, …) could be pulled apart, modified and reassembled into different configurations. This was key for AOL and many other vendors who wanted to create a branded web browser by slapping a different “skin” on top of the same pieces under the hood.

In fact the ease of integration for IE helped spread the web-capability into all Microsoft applications. That Outlook could host HTML email or Excel easily import tables from a web page using point-and-click process (simplifying the problem of screen scraping, before the idea of semantic web could take off) was not a far stretch. But the fact that Visual Studio doubled as a full-fledged web browser suggested things were going too far– a contemporary twist on the joke that every application written at MIT grows until it can read email. Soon every application contained a tiny embedded browser somewhere, a tradition that continues with MSN/Live Messenger today for example.

Firefox has co-opted this capability in a very interesting way, to its own benefit. The IE-tabs extension for Firefox allows viewing one or more tabs as an embedded IE page by hosting “Trident” (aka mshtml.dll) the HTML rendering engine. Net result is an option for 100% compatibility with IE.

This is very interesting because interop in a competitive market always helps the small fish. In a world where network effects are significant, being compatible with the market leader is often essential for survival– and raising the cost of interoperability is very much in the interest of the market leader. This is why OpenOffice spends a lot of time trying to reverse engineer and implement Microsoft Office document formats, but not the other way around. Web browsers are not like telephones, instant messaging clients or productivity software in this way, because the concern is not about pairwise interop. But there is an indirect effect: when most users have Internet Explorer, most websites will be optimized for IE. This is a completely reasonable business decisions for a vendor, not a conspiracy. When 90% of users are on platform Foo, the efficient web designer will focus his/her resources on making their service function well on platform Foo. There is a point where investing in the others is justified, but it will always take a back-seat to the primary revenue generator. This is the benefit of enjoying the market leading position: other software has to emulate the leader, not the other way around.

Granted the existence of standards such as HTML, Cascading Style Sheets, DOM etc. simplifies this somewhat. In principle it is no longer a moving target because all clients are trying to implement the same specification, instead of one group making this up as they go along and everyone else a passenger along for the ride. In reality, this vision fails because:

  • HTML standard gives plenty of leeway for conforming implementations to create different user experiences based on the same document
  • Web browsers are far from perfect in their conformance to standards
  • Each browser defines a different extensibility framework such as Flash, Java, ActiveX controls etc. some of which are highly dependent on the underlying platform, eg the operating system.

Outcome is compatibility problems. Large websites often have the resources and expertise to develop and run Q&A cross platform, so they have no problems. It is often the smaller scale operations, hobbyists or niche audience hang-outs that have spotted history of browser compatibility. By a stroke of luck they could work just as well in Firefox, Safari, Opera etc. Equally like they will have minor visual blemishes, missing functionality or flat out not function at all. This is where the IE Tab add-on comes to the rescue for Firefox.

[continued]

cemp