|
By Vincent Capo
Amazon has issued a detailed analysis and apology on last week's
large scale crash of its cloud service EC2, an event that brought down
dozens of websites and caused many businesses to suffer from lost
e-commerce revenue opportunities.
The disruption to Amazon Web Service's (AWS ) Elastic Compute Cloud,
or EC2, limited customers' access to much of the information that was
stored in the company's East Coast regional data centers. Approximately
75 sites offered varying degrees of availability of their web services
because of the outage.
Until now, Amazon had stayed relatively silent about the cause. But
after completing a post-mortem assessment of the mess, the company
issued a detailed, 5,700-word explanation of what went wrong.
The AWS systems event was the first widespread outage EC2 has
suffered since launching about five years ago. The event was described
by Amazon engineers as a technical perfect storm. An initial mistake
made by Amazon's engineers caused a cascade of other bugs and glitches.
"As with any complicated operational issue, this one was caused by
several root causes interacting with one another," Amazon wrote.
On April 21, AWS tried to upgrade capacity in one storage section of
its regional network in Northern Virginia. That section is called an
"availability zone." There are multiple availability zones in each
region, with information spread across several zones in order to
protect against data loss or downtime.
The upgrade required some traffic to be rerouted. Instead of
redirecting the traffic within its primary network, Amazon accidentally
sent it to a backup network. That secondary network isn't designed to
handle that massive traffic flood. It got overwhelmed and clogged up,
cutting a bunch of storage nodes off from the network.
When Amazon fixed the traffic flow, a failsafe triggered: The
storage volumes essentially freaked out and began searching for a place
to back up their data. That kicked off a "re-mirroring storm," filling
up all the available storage space. When storage volumes couldn't find
any way to back themselves up, they got "stuck." At the problem's
peak, about 13% of the availability zone's volumes were stuck.
But why did a problem in one availability zone ripple out to affect a
whole region? That's precisely the kind of glitch Amazon's
infrastructure is supposed to prevent.
The analysis after the event brought to light that EC2 had a few
bugs. Amazon describes them in detail in its analysis, but the gist is
that the master system that coordinates all communication within the
region had design flaws. It was overwhelmed, suffered a "brown out,"
and turned an isolated problem into a widespread one.
Interestingly, those bugs and design flaws have always been in place
but they wouldn't have been discovered if Amazon hadn't goofed up and
set off a domino chain.
Amazon says that knowing about and repairing those weaknesses will
make EC2 even stronger. The company has already made several fixes and
adjustments, and plans to deploy additional ones over the next few
weeks. The mistake presented "many opportunities to protect the service
against any similar event reoccurring," Amazon said.
Of course, Amazon's customers are less than happy to have been
guinea pigs in this cloud crash learning experience. Amazon offered a
mea culpa, and said it would give all customers in the affected
availability zone a credit for 10 days of free service. Big friggin deal
if you ask me!!
"We want to apologize," the company said in a prepared statement.
"We know how critical our services are to our customers' businesses and
we will do everything we can to learn from this event and use it to
drive improvement across our services."
This is the first of many cloud related disruptions we can expect as
the cloud goes through growing pains and the problems of redundancy,
latent demand, capacity management, and systems security are continually
improved upon by closed loop feedback from outages as the one just
suffered. It is fair to note that AWS EC2 service is more robust today
than it was last month - that is the silver lining.
published by myITview.com
Only registered users can write comments. Please login or register. |