by CJ Williams 2018-09-07

PlayFab Site Reliability

At PlayFab we have a motto, "Ad facilia per aspera." It means that we make the hard stuff easy.  One of the hardest things to make easy is a highly scalable reliable service, but PlayFab is committed first and foremost to provide you a service that is reliable at all times.  We know that anytime we have an incident that it affects your game and most importantly your gamers.  We have had a good track record of service reliability, but June and July were two months with higher than normal incidents.  While not all of these incidents affected every game or even every player in every game, we want to be transparent with you about how we handle incident response and what work we do to prevent incidents from happening again.  For those of you that had games that were affected, please accept our apologies and read on to see how we will improve our incident process going forward.   

With a higher than normal number of incidents, we have taken a step back over the last month to understand how we can prevent these incidents moving forward.  As part of this process, we are making some changes to our incident response process.   

Here are some of the incident response processes that we are improving:                                       

  • For any major or minor incidents, we will be posting a full post-mortem.  Our goal is to post the post-mortem soon after the incident occurred, but we will always do a thorough root cause analysis before posting.  We will be posting all the post-mortems for June, July, and August immediately.  You can go over to our PlayFab status page to read them now.
  • We have separated the communication of incidents and mitigating or fixing the incident.  When PlayFab was a start-up, the on-call engineer handled both responsibilities.  As part of this, we will be communicating any incident on both the PlayFab status page and our slack channel.
  • We will be posting updates every 15 mins until the incident is mitigated.  While we don’t ever want an extended incident, we are committed to ensuring you always know the latest status.

Thank you for joining us on this journey to be the best and most reliable service for Full Stack LiveOps and Real-time Control.  Ad facilia per aspera.

As always, we love to hear your feedback! So please, if you have questions or concerns about these changes, let us know on our forums.