Saturday, June 30, 2012

Today's "I was wrong" - Netflix, Amazon, and availability

Earlier this morning, I wrote a post in my tymshft blog entitled Amazon cloud failure – non-tech CAUSE, but tech REASON. This post, a follow-up to an earlier tymshft post entitled The decidedly non-tech reasons for computer system failures, dug into the Friday night failure of several Amazon-hosted services, including Pinterest, Instagram, and Netflix. The immediate cause was the weather, but weather alone cannot bring down a system.

In this morning's tymshft post, I opined:

Does Netflix require five nines availability? Well, that depends upon its users. I’d be willing to bet that most Netflix users haven’t thought about five nines availability. In the pre-Netflix days, back when people used to congregate in buildings called “movie theaters,” we didn’t necessarily think about five nines availability either.

In addition, I don’t know if Amazon promised five nines to Netflix and the other companies. My bet is that they didn’t promise it. My guess is that Amazon offered a higher availability option to the companies, and the companies rejected the offer for price reasons, sticking with a lower availability level. What could go wrong?

Well, the blogger could go wrong, that's what could go wrong. My assumption that Netflix hadn't purchased high availability was incorrect, according to Data Center Knowledge:

The latest outage was unusual in that that it affected Netflix, a marquee customer for Amazon Web Services that is known to spread its resources across multiple AWS availability zones, a strategy that allows cloud users to route around problems at a single data center. Netflix has remained online through past AWS outages affecting a single availability zone.

Forbes provided a little more detail, including a tweet from Netflix's Adrian Cockcroft:

What’s interesting is Netflix seems to have the multi-region redundancy built in, but ran into issues with Elastic Load Balancing, which is the portion of Amazon’s service that tells web page requests which servers to get them from, connecting user requests to functioning instances. Or at least that’s what this tweet from Adrian Cockcroft seems to imply:

@jakeludington @wh1t3rabbit lost instances
in one zone, but lost ELB traffic routing to the zones that were
working… gradually coming back

— adrian cockcroft
(@adrianco) June 30,

So it appears that Netflix took a number of steps to ensure high availability to its customers - but something still went wrong here. Netflix is presumably looking at the issue right now, and may or may not reveal what it finds (it may choose not to reveal this because doing so could benefit Netflix's competitors).

Regardless, my tymshft post had an erroneous assumption.
blog comments powered by Disqus