Yesterday, we were Slashdotted. We had turned off caching the night before to test something with our sidebar, and didn’t figure /. would pick up Edgar’s piece, so we left it off. Naturally, the traffic crush hit and HTMList became unavailable at around noon, PST.

I’ll go into some more detail on what we did to bring things back up and keep the site available in another post, but suffice to say, we moved our background image, subscribe/RSS feed image and Synapse Studios logo over to S3 to help take some of the load off. And of course, S3 went down today.

As of right this moment, here’s where they’re at:

9:05 AM PDT We are currently experiencing elevated error rates with S3. We are investigating.
9:26 AM PDT We’re investigating an issue affecting requests. We’ll continue to post updates here.
9:48 AM PDT Just wanted to provide an update that we are currently pursuing several paths of corrective action.
10:12 AM PDT We are continuing to pursue corrective action.
10:32 AM PDT A quick update that we believe this is an issue with the communication between several Amazon S3 internal components. We do not have an ETA at this time but will continue to keep you updated.
11:01 AM PDT We’re currently in the process of testing a potential solution.
11:22 AM PDT Testing is still in progress. We’re working very hard to restore service to our customers.
11:45 AM PDT We are still in the process of testing a series of configuration changes aimed at bringing the service back online.

Moving data into the cloud and relying on the cloud is a fantastic concept for a lot of reasons. Redundancy, scalability and pay-as-you-need is a tempting system. But when the system still has single points of failure, or chokepoints that can cascade and bring the entire service down, we need to be cautious with what’s hosted there.

Now, we’re a relatively small blog. It’s not a huge deal that our background image isn’t displaying. But photo host SmugMug‘s entire service relies on S3 hosting to store your photos. When S3 goes down, their entire business model fails. And it’s difficult to explain to your clients that your hosting provider is down without looking cheap or like you don’t know how to build for these situations or that you’re not able to handle playing at this level. Even though you’re not, and you do and you are.

Your customers don’t really care about a distributed, cloud-based storage model that allows you to ensure data integrity and availability 99.9% of the time. They see that 0.1% and start to wonder. Because people take availability for granted. When you’re not a web developer or database administrator or scalability engineer, you don’t have to apply the cycles to wonder exactly how, say, Google is delivered billions of times a day without you paying a single cent, and in less than a second.

But the truth is, it’s exceptionally challenging work that relies on hardware components that can fail, software optimizations that can be brought down by edge cases or missed opportunities and humans at every step that are learning as they go here.

So for now, we keep refreshing the status page and know that Amazon is more incentivized than anyone to get their services back up and running. They do, after all, eat their own dog food and use S3 to host their product images.

Posted in: Tech News