Cloud Panic – A Cautionary Tale

For almost an hour yesterday, I thought my AWS instance that hosts my company website was irrecoverable. Randy Bias suggested that virtual servers should be like cattle- when they get sick, you shoot them. My instance was sick, but I needed it to return to health to avoid a significant time rebuilding the server.

The AWS team has done an amazing job engineering the platform; however, as with all systems, failures happen. While I was testing a php script, I lost connectivity to my website. Shortly after, my ssh session disconnected. The AWS management console showed that my instance was stuck at 100% CPU utilization.

I tried to stop the instance. The instance got stuck in stopping state. Apparently this is a common problem, as a question on EBS-backed instances getting stuck in stopping state has made its way into the AWS EC2 FAQ. The company points to an issue with underlying host when this occurs.

Getting my website back online should have been as a simple as a few clicks. This is the cloud, right?

Unfortunately, this wasn’t going to be the case for me. Although I regularly backed up critical directories on the server, I hadn’t taken a snapshot in almost six months. Not good!

I wanted to try stopping the instance using the EC2 command line tools…but I had deleted the virtual machine on my laptop that had the tools installed. I quickly fired up an instance of the AWS Linux AMI because this AMI has the tools pre-installed.

While I was uploading my certificate and private key to the new instance, my instance escaped the stopping state purgatory. I restarted it, and my site was restored.

I have advice for wannabe sysadmins like me who host their company websites on AWS.

  1. Do a gut check. If downtime for your site results in revenue loss, consider outsourcing to a professional.
  2. Install EC2 tools on a server that you can access in case of emergency. Set up the certificate and private key in advance and verify that it works periodically.
  3. Take snapshots prior to major changes on the server and practice restoring your instance from a snapshot.
  4. Learn mysql basics. You should know how to backup and restore your database from the command line.

I got lucky that my instance recovered. Take steps to ensure that you don’t have to rely on luck to save your behind.