At PlayFab, one of our core tenets is no downtime. This makes challenges like the recent overhaul of our AWS topology particularly interesting. When you're a backend-as-a-service, you can't just turn everything off for six hours. (Or six minutes.) This, then, is a postmortem of how we migrated from AWS EC2-Classic to AWS Virtual Private Cloud (VPC) without losing our data, our customers, or our minds.
Most postmortem blog posts are teary affairs explaining how something went wildly wrong and how this will not happen again, but we think postmortems are useful for any major change, whether a success or failure. It's great to learn from mistakes, but it's also useful to learn from successes that demonstrate due diligence to the only true shortcut: Do things right the first time.
It used to be that servers were a fixed resource -- they were bought and racked, then replaced with more powerful ones on a schedule. Using cloud services changes this. Instead of having to figure out a workload and pick a server size for it, the workload and server size can be combined to figure out the best price/performance ratio. Tuning for price/performance is more complicated than tuning just for performance, but it's much more efficient. In order to pass those savings on to PlayFab's customers, we decided to start using some of the new AWS instance sizes: t2 and c4.
These instance sizes are only available in AWS VPC instances, and while that was fine for new systems, all of our existing systems and security groups had been set up originally in AWS EC2-Classic. Unfortunately, the two security group types do not easily mix. Security groups (which define what IPs can access what ports, similar to basic IPTables firewall rules) cannot be shared between EC2-Classic and EC2-VPC, and at PlayFab, we make extensive use of security group membership-based access. For example, our core API servers use a security group to access the port for our backend database servers.
The underlying mechanism here is that each member of a security group, when added to another security group, is enumerated into an expanded list of private IPs to allow access. For example, if the apiservers group contains instances i-123456 and i-abcdef, and the apiservers group is added to the database-access group with an allowed port of 1234, it's the same as adding the private IPs of i-123456 and i-abcdef directly to the database-access group with an allowed port of 1234. However, since this is an enumerated list, simply creating a new server instance that is a member of the apiservers group will give it access to the database.
To complicate matters, we were moving from a script-based deployment methodology that used the ebextensions functionality in AWS Elastic Beanstalk to a new one that uses the SaltStack automation tool.
Step 1: The shadow environment
PlayFab's architecture is set up such that each endpoint server is effectively stateless. This meant we could stand up a second environment and test how connectivity to databases and other resources worked with absolutely no customer traffic hitting these endpoints.
The new deployment mechanism uses a combination of AWS Elastic Beanstalk to provide automated infrastructure setup, as well as SaltStack to manage and configure the rest of the machine. All customer-facing machines are running Windows Server, and we have pre-built AMIs containing a fully patched and ready to go copy of Windows, pre-loaded with all the tools needed to set up a server.
This meant that we could stand up new Elastic Beanstalk environments side-by-side with the current running environments, using the exact same binary packages to install the API endpoints. We could then wait for the deployment mechanism to pull down the configuration and install monitoring tools, utilities, and the new service discovery agent (Consul) that was being deployed as part of our new automation package.
Previously, PlayFab had been using Route 53 as a sort of pseudo-service-discovery provider, either by aliasing Elastic Load Balancers to domain names, or by assigning Elastic IPs to instances and then mapping those IPs to names. Consul turns all that on its head by providing a DNS server that intelligently returns the IPs of registered services based on health checks. If an instance fails its health check, it's simply not returned by Consul as an endpoint.
To get around the problem of having lost the EC2-Classic security group function of nested group access (and since moving the databases in Amazon RDS would take both downtime and a lot of effort), we created proxy servers that used Elastic IPs to maintain a static IP, then added the static IPs to the database security groups. While this means that we are bandwidth constrained to the database by going through the proxy, the nature of it also means that adding new proxy servers will increase the pass-through bandwidth, but because we are using AWS DynamoDB for much of our data storage, the traffic to SQL databases is minimal anyway. We also deployed multiple proxy servers to provide highly available access. This had the happy side effect of allowing us to provide a few static IPs to several third-party services that needed to use IP-based whitelists, actually reducing our overall use of Elastic IPs.
Some of our services are singletons, in that there can only be one running at a time, and if there are more than one running, they'll step all over each other. We chose to use Consul's leader election mechanism to provide both high availability and single-server exclusivity for these services. Consul provides a key/value pair store that allows locks to be taken on specific keys and held until a specific condition expires, such as a health check failing. This means that the leader of an active/passive pair can lock the key, and the passive member of the pair can just listen to the key, so when it unlocks, it can take over. This is fully integrated into the service discovery framework as well, so we can have instant fail-over of critical singleton services.
We found a few issues in the shadow environment, which was great, since that meant customers wouldn't run into them during normal use. The biggest was a misconfigured security group on one of our SQL databases that simply wouldn't let us contact it. A good suite of integration tests is amazing here, especially if they can be pointed directly at your production environment (or, in this case, the shadow of production).
The other major issue we ran into was that AWS Elastic Beanstalk did not yet support c4 instance types, but that wasn't a big deal, since Elastic Beanstalk is simply an AWS CloudFormation template under the hood. Since the ebextensions files modify that CloudFormation template directly, we just added the c4 instance types to the allowed instance mapping and off we went.
Step 2: Cutover
The same architectural decisions that made it easy to stand up a shadow environment made it easy to use a red/black deployment test with live production traffic. We set up a new DNS record in Route 53 and used weighted DNS records to give the current production all of the traffic, slowly cutting over more and more while monitoring constantly for errors.
We did see a few login failures related to PlayStation Network whitelists not being complete, so we had to disable one of the proxy instances to remove its IP from the proxy server pool until PSN could fix the issue.
There was also a cutover failure in our game developer administration site, which had a DNS cache problem. It had failed to look up the proxy server for one of the databases, and we discovered that Windows caches DNS negative responses (an authoritative response saying that a particular record does not exist) for 900 seconds (15 minutes) by default. This meant that since it hadn't discovered the proxy server on its first try, the DNS Cache service would return a negative response for all lookups for that name in the next 15 minutes. This was resolved by cutting back to the original administration site, then deploying an update with our shiny new SaltStack automation tool to set MaxNegativeCacheTtl to 0 before successfully cutting over to the new site a second time.
The rest of the services, including the singleton services, were cut over without incident.
Step 3: Post-deployment Monitoring
The new servers have been live since March 2nd, and we've seen a nice reduction in cost due to using t2 instance types for some of the spikier workloads or less-used services and a slight reduction in both latency and CPU use due to the SR-IOV Enhanced Networking and the newer "Haswell" CPUs in c4 instance types.
An EC2-Classic to VPC migration of a live service is scary since it's a huge architectural change with a lot of moving parts, but good planning, testing, and execution meant that overall the whole thing went smoothly with no issues that would have caused gamers or developers to experience errors or be unable to use the service.
A couple of lessons learned for next time:
- Windows caches negative DNS responses.
- When relying on an IP whitelist, confirm yourself that all IPs are whitelisted, not just a sample.
And by far the biggest lesson is one we really already knew: The ability to slowly cut traffic over is amazingly useful.