Every API we add to the PlayFab service gets tested over and over again: at the unit level, at integration, and in production. We’ve now gone one step further and are using those same integration tests to monitor the status of all our services. This gives customers even more detail about our uptime, while allowing us to quickly pinpoint if there are issues with any of our 3rd-party integrations, such as Facebook. And because we’ve tied it to our integration tests, we don’t need to worry about monitoring lagging behind production code.
It starts with wanting to be as efficient as possible (and not just because efficient engineers have higher scores in AdVenture Capitalist). We like to write clean code, test that code, and then deploy that code to our customers with a minimum of fuss. What we don’t like doing is writing the same tests for that nice, clean code over and over again — once for our unit tests, again for integration tests, and then once for every different monitoring system we use. Not only does it cost more on the front-end of development, but there’s a maintenance cost to pay that grows with every new API we add into the system.
Transparency is also very important to us — especially when it comes to uptime of our APIs. We want all of our clients to feel confident in knowing that our services are up and running, and confident that we’ll expose any outages. We don’t want our customers to waste time investigating an outage if it’s an issue on our end, or if we can readily trace it to one of our third-party integrations.
While we have used third-party services like Pingdom, we decided that since we’d written all of these wonderful tests for our API, we ought to really put them to do overtime duty doing some monitoring as well. The results of that monitoring are now available from Service Health page. We’ve started with just a few APIs and two regions, but coming soon we’ll report uptime status of all key functionalities as well as our third-party integrations from regions all over the world.
The most critical element in the pipeline is, of course, a reliable and repeatable set of tests that verify our most important pieces of functionality. At PlayFab we test on several different levels — at the unit level, at the interface level, and finally at the API level. Each level covers unique but equally important chunks of functionality, but it’s the top level at the API that provides us with the most flexibility. We were already using these tests to validate both local developer builds and release candidates, so it was only natural to extend them to validating running production code as well.
One Easy-to-Read Monitoring Report
In eras past, we might have stopped there — after all, having a set of tests that can be run against a production environment is pretty valuable. But nowadays, it costs almost nothing to have low-powered virtual machines running all over the world thanks to the power of the cloud. With a little technical magic, we installed our tests into those virtual machines, wired them up to chatter together, and then collated all of that data into one easy-to-digest report both for our own DevOps’ use as well as our customers.
The best part of the pipeline is that tests are never out of date or in need of updating. The exact same pieces of code are being run to pass a build as are being run to monitor our live services. This avoids the problem of monitoring falling behind production code. A developer will always update these tests for any changes because they have to — they won’t be able to check in their changes with failing test cases!
A great example of our end-to-end monitoring is when game developers use third-party authentication services such as Facebook. Internally, we have a number of unit tests and integration tests to validate against all kind of scenarios players might encounter when using Facebook authentication to log into a game. We have promoted our “success scenario” test case – one where a player with valid credentials attempts to log into a title using Facebook and succeeds — as one of our monitoring cases. This means that the same test case will be run on a developer’s build when she is preparing a commit, on our continuous integration pipeline after she has submitted to verify the build, and again against production.
If we happen to make a change to the API that governs Facebook login, then the developer will only have to update one place to ensure that everything continues to work happily. Both PlayFab developers and game developers building on our platform will be able to check the same place to ensure that Facebook login is working end-to-end: Not only can they be assured that PlayFab is working, but that our connection to Facebook is solid as well.
We are carefully updating our own, most valuable, internal monitoring API tests to be consumer-facing and meet our requirements for trustworthiness. Come back over the coming days and weeks to see the list of checks!
Obligatory pitch: If you found this post interesting, come and check out my in-depth tech talk at the STARWEST conference in Anaheim on October 1. Register using the promo code "SW15MB42" and save.