Amazon Explains S3 Outage: Gossip Kills
By Chris Cardinal
On July 27th, 2008
Amazon has released a rather comprehensive write-up on their post-mortem analysis of why Amazon S3 went down last week. The S3 servers use a gossiping protocol to determine system states, including what servers are available and the status of the nodes across the network.
A single bit corrupted in several of these gossips such that they were still intelligible but reflecting inaccurate data about the system state. These propagated through the network (much like a virus, really) and caused most of the servers to spend most of their time gossiping or failing to complete the gossip; if the gossip doesn’t complete, the server can’t/won’t send its data.
While Amazon MD5 checksums data in containers to ensure its integrity as its being transmitted, they weren’t doing this on their gossips. They’ve since established several new practices to attempt to ensure that a problem like this won’t cause a failure across the entire system, including better failure handling with gossips and faster restoration when nodes do go down.
They end their missive simply enough, owning up in a way I give them credit for:
Though we’re proud of our operational performance in operating Amazon S3 for almost 2.5 years, we know that any downtime is unacceptable and we won’t be satisfied until performance is statistically indistinguishable from perfect.
“Statistically indistinguishable from perfect” is a rather poetic phrase, and I’d like to think we strive for that over at Synapse Studios. But my stats-masters programmer would just mock me.
Read their full statement here.
Tagged with: amazon, amazon aws, amazon outage, amazon s3, failboat, gossip, post-mortem, s3, statistics
Posted in: Tech News
Related Posts
- GigaOM Talks to Amazon’s Jeff Bezos about Amazon Web Services
- Problems In The Cloud: Amazon S3 & SQS Down
- Trusting In The Cloud: A Call For Post-Mortem As Facebook Loses Notification Settings
- Building your web development blog feeds: 30 sites to follow
- Magento eCommerce Review: Platform Perils and Impressions, Three Months In












December 3rd, 2008 at 3:57 am
[...] Several months ago, Amazon’s distributed file storage system, S3, suffered a severe outage that lasted for hours. Now, the situation is a bit different: Entire businesses rely on Amazon S3 to be functioning for their livelihoods. Outages mean lost income and lost trust. So Amazon did what Facebook absolutely must do: they issued a full post-mortem that explained their engineers’ findings and failings, their root-cause analysis, what caused the problem to cascade across their network, and most importantly, the measures they’ve taken to ensure that, to the best of their ability, this would never happen again. [...]