We here at PlayFab are very proud of our uptime record, which had been 99.995% for more than 2 years... until a power outage at an Amazon Web Services (AWS) data center brought us down for two hours on Memorial Day (May 26), 2014.
While we couldn’t have anticipated this outage, we could have done more to prevent such an outage from taking down our whole service. A calculated risk that was acceptable back when we were the internal service for Uber Entertainment is no longer acceptable in our new incarnation as PlayFab. Now that reliability is a big part of what we’re selling, we need to reduce our exposure to downtime risk.
This blog post describes what we’ve done to increase redundancy and reduce the chances that a repeat AWS outage would take us down again.
To understand what happened, you need to know how Amazon has architected AWS, which provides the underlying servers and services that PlayFab runs on. AWS is first divided into regions, which are geographically dispersed and completely independent from one another (e.g., California vs. Oregon). Each region is further subdivided into multiple availability zones, which are physically isolated from each other and do not share common points of failure like generators or cooling equipment. It’s described here if you want to read more about it. Amazon has also documented their own recommended best practices regarding fault-tolerance.
On May 26, around 5:45 PM PST, a single availability zone in AWS region US-WEST-1 experienced a power failure for almost 2 hours. Because we were running several core components of our service in that zone, this outage effectively took out PlayFab for the duration.
We had known that proper fault tolerance requires running in multiple zones, if not multiple regions, but because AWS has generally been quite stable, we had previously accepted the risk of being in a single zone in return for reducing our cost and complexity.
Complexity, because at that point, we had mostly been relying on manual provisioning of new servers. Since setting up a new server isn’t much work, just 10-20 minutes, it had been easier to just keep manually adding servers. The lack of automation, however, meant that setting up a completely new, redundant version of PlayFab in a new region was a much bigger project.
Cost, because it also would have been expensive -- maintaining a fully redundant version of PlayFab, able to take over 100% of the load, would have required 2x the number of servers.
This outage forced us to realize that as a separate company, where our service is everything, we could no longer risk running in a single zone. We would have to split into at least 2 zones. Once we decided to address this issue once and for all, the actual implementation was straightforward --- and has resulted in a number of additional benefits besides simply greater reliability that are outlined below.
The most important thing we did was to invest in proper automation, helped in no small part by the recent hiring of our first full time Senior DevOps Engineer. We decided to use Amazon’s Elastic Beanstalk, their free service for automatically deploying and scaling web applications. With Elastic Beanstalk, we package our services in a specific format that AWS knows how to deploy. Electric Beanstalk sets up monitoring and auto-scaling, meaning that the system will automatically add or subtract servers as needed.
Once Beanstalk was in place, setting up a redundant implementation of PlayFab in a second availability zone was quick and easy -- and importantly, because Beanstalk can quickly spin up new servers as needed, it didn’t incur an additional cost hit.
Say we need 10 API servers to handle our standard load. Now we can have 5 servers running in each zone and Amazon’s Elastic Load Balancing (ELB) will route traffic evenly across all of them. If one availability zone suddenly fails, and half the servers go offline, then Elastic Beanstalk will automatically spin up new servers in the surviving zone. Within minutes we’re back at full capacity.
The final thing we did was invest in a more robust outage notification system. Previously our monitoring had been quite robust, using a combination of Pingdom, New Relic, and Amazon’s Cloudwatch to monitor our server operations, but relied mostly on email alerts to know if something went wrong. An email, however, lacks urgency, and it’s not clear who on the team is handling it.
We have now consolidated our alerts behind the excellent VictorOps service - so now we have much better notification and tracking in the event things do go wrong. Furthermore, we have implemented custom app-specific monitors using Cloudwatch so our ops team can also monitor KPI’s like “# of connected players” or “server errors per second”.
Other Benefits to Automation and Redundancy
With these fixes in place, we have greatly reduced the chances that a similar AWS power failure in a single zone will bring us down. The investment in proper automation, however, has also yielded a number of additional benefits.
The biggest benefit was the cost savings made possible by running our current servers “hotter” than we had previously been able to run them. Thanks to Beanstalk, we’ve been able to cut back on the total number of servers, but still be confident that a sudden spike of load (such as an editor’s choice promotion, or newly released game patch) will be handled -- and then automatically reduced once the surge has passed. This is especially critical for game servers since their load patterns are so “spiky.”
We can also now configure entirely new server clusters much more easily. This allows us to offer enterprise customers their own, independent server clusters, isolated from other games and other developers. Previously, the work involved in setting up and maintaining independent clusters would have made this prohibitive for all but our biggest customers. Now, this is a feature we can offer to even much smaller developers or publishers who care about isolation.
It also makes it easier for us to set up additional environments for testing or demos, and lets us consider adding a third “staging” environment to our existing “sandbox” and “live” environments.
The fixes described in this post are just the first steps toward increasing our fault tolerance. In the works is yet another layer of redundancy -- setting up redundant regions, in case an entire region goes down.
We will probably never know exactly what caused the power outage in AWS. But given that it led us to set up a far more resilient and efficient system, and because the impact to our customers was minimal (due to the date, the time, and the fact that none of our customers was in the middle of a critical launch), we’re ultimately grateful this happened when it happened -- and hope sharing the details of our experience is helpful to others.