Lessons from the Amazon cloud computing outage

On April 26, 2011, in cloud computing, by Jason Leeson

By now, we all know that Amazon’s cloud computing service suffered an outage last week affecting many of their North American customers. Now that things have returned to normal, this is a great opportunity to step back and look at some of the things many customers have overlooked while jumping on the cloud computing wave sweeping the hosting and IT markets.

First, let’s look at what hasn’t changed:

  1. Cloud hosting service availability: For the most part, major cloud hosting providers still provide a much higher level of availability than most businesses can provide if they were to run everything in-house.
  2. Cloud computing cost effectiveness: Adding up the hardware costs for servers and storage with the labour costs to run them in-house is much more expensive than using a cloud provider for your infrastructure needs. Factor in the flexibility to only pay for the computing power you need at any one-time and it’s not even close. In his latest post, Ray Wang over at Constellation Research, calls it a factor of 10 cheaper.
  3. Outages will happen: Outages have always happened whether you run services in-house, use colocation providers, managed hosting providers or cloud providers. Most often they are very short. Sometimes they aren’t. The issue at hand is how your cloud provider handles the situation, communicates with you and helps you get running again.

That being said, as the number of mission critical applications that are running on cloud services grow, cloud computing providers, especially the public cloud variety, need to start fulfilling their end of the deal when it comes to enabling their customers to deal with outages. Here are the things we believe a cloud provider should be expected to do:

  1. Be Transparent: Share the details with your customers on how you deliver your service. Do you use well known vendors and tools or have you built your own proprietary hosting solution? How do you guard against outages? How do the storage services and compute services affect each other if one is not available? We wrote a blog post earlier this year about the importance of understanding the details behind a cloud provider’s offering. This thinking is gaining traction these days. Gartner Analyst, Lydia Leong, just wrote a similar piece in the Amazon outage aftermath.
  2. Communicate Risk: Be open with your customers about risk. If they are running their services in one zone or one datacenter, make sure they understand the risk they are accepting by doing so.
  3. Communicate Disaster Recovery Options: Help your customers understand what options they have for implementing a disaster recovery solution. Do you provide a service to mirror their applications in a separate datacenter or zone? Is it up to them to figure it out on their own?
  4. Support: Support really matters when things go sideways. Many public cloud providers, only offer online support options. They don’t offer dedicated customer support reps that enterprises tend to need when they are helpless to solve the problem themselves.

On the other hand, as a cloud computing customer, what can you do when selecting your cloud provider?

  1. Understand the risks: Contrary to the advice of many pundits, you can’t treat your service provider like a black box. You need to understand the cloud provider’s architecture in order to plan recovery from significant outages.
  2. Understand the impact: Are you using a cloud provider to run non-mission critical test and development systems or are you using a cloud to run internal or external mission critical applications. The impact of an outage on your business is much different in each case. Many of the high profile companies affected by the outage used Amazon to run mission critical services without a proper disaster recovery or contingency plan.
  3. Plan for a significant outage: Whether you run your services in-house or with a cloud provider you need to do this.  For Amazon customers, having instances ready to run in a different region that weren’t affected and a plan to switch between the two regions may have worked. Radiant operates East and West coast cloud hosting centers and we help our customers create a recovery plan that will bring their services up in the secondary datacenter should the primary one be affected by a disaster (think earthquake).  If you don’t have the in- house IT skills to set this up, find a cloud provider that provides additional services and will help you with your plan.

Looks like it’s time to dust off the disaster recovery plan…

Tagged with:  

Leave a Reply