Cloud scaling lessons from the recent Amazon cloud services failure
May 4, 2011
A couple of weeks ago, I wrote about the need for reliable, secure clouds in the midst of Amazon AWS’s massive outages, which affected a wide swath of technology startups, including some of our own portfolio companies. This will have lasting consequences on the product and development processes at startup companies, as they will have to incorporate more risk management into their planning. Does it mean that technology companies looking to leverage the cloud to scale should rethink their reliance on Amazon AWS?
Probably not, because as Chris Evans at the StorageArchitect blog argued in his post, and as many other have since pointed out, one should always plan for eventual outages of such a vast, complex infrastructure.
Having inter-vendor portability, as I noted in the previous post, is important to the ultimate development of the cloud services market, but in the mean time, it is equally important to find ways to prevent such meltdowns, especially with services that rely on AWS for their core operations.
At first it appeared that in the same “Regions” affected, some customers were luckier than others, and disaster struck at random. However, in the wake of the outages, many studies have emerged that looked at what went wrong, and how some companies like Netflix, Twilio, Mashery and Instructure, to name a few, were able to overcome the troubles. The post mortems offered by those companies are great reads for those interested in building scalable, reliable and high-performing web systems. I will just cite a few here as starters:
– Charles Babcock’s article in InformationWeek has great details on how Bizo quickly reacted to the issue, and how Mashery’s failover plan anticipated the problem and dealt with it smoothly.
– Stephanie Overby’s article in ComputerWorld highlights 7 takeaways from the experience of Mashery and Bizo, among others
– Eric Kidd has a detailed timeline of the outages and highlighted many lessons as well
– Clay Loveless’s Failure is not an option puts the reliability issue in a more comprehensive framing of best practices in scalable system design.