One of the challenges of growing a Software-as-a-Service company is forecasting costs at scale. It's one thing to lose money when you're just starting out and dealing with unavoidable fixed costs. But get your costs wrong at scale, and that attractive pricing table can become a death sentence under the weight of negative margins.
At PlayFab, we've tackled the problem of forecasting costs by relying on a tool very familiar to us as backend engineers: Automated load testing. Using the same automated load testing framework we use to define maximum total capacity, along with the AWS detailed billing report, we can create a microscopic view of our entire cost structure and give our executives the tools they need to set a viable pricing model.
The Basic Challenge: Calculating the Margin
At the core of any business is this simple question: What is your margin? When you subtract the costs of goods sold from your revenue, are you (or will you be) profitable? For PlayFab and similar SaaS companies, the key number in determining COGS is the cost per API at scale. While it's easy to calculate the average cost per call across all APIs, it's more difficult to determine the cost per call of a specific API or set of APIs. For example, if RegisterPlayFabUser uses more resources than UpdateUserData, it's important to be able to quantify that before adding a new feature that may result in heavier use.
It can also be difficult to calculate the "at scale" part of this equation, particularly for startups. Cost varies greatly with scale and a startup usually has not reached the scale where margin calculation is meaningful. The solution then is the same as the one used to estimate maximum capacity: simulate the scale with load tests.
Setting Up the Automated Load Tests
The load test is designed to answer this one question: given a finite resource, for example, a single AWS c4.large instance for one hour, how many APIs can we serve? For simplicity and easy calculation, we use one AWS EC2 machine to generate load against another EC2 machine which serves requests.
(PlayFab builds on top of Amazon Web Services, so some techniques explained here apply only to AWS, but the concept behind them should be relevant to all platforms.)
From the usage statistics we've collected, we pick APIs that are being frequently called or are relevant to a particular new feature we're investigating. To achieve the desired utility rate of the resource, we next determine how much load we are going to generate, i.e. how many load generator threads are pounding against the API server. This requires a performance curve of each API against each bottled-neck resource. It can be CPU, memory, IO, etc., due to the different nature of each API. We have done standard stress tests to understand the performance nature of our APIs and respective bottlenecks. In our case, the bottleneck is usually CPU usage. Our target is to ramp up the load until we reach 70% CPU usage across the board when we are running at scale.
Using AWS Detailed Billing
Since our ultimate goal is to figure out the AWS cost of a single API call, we need to set up the test in such a way that we can derive that information. With a myriad of AWS components and pricing models, that is not an easy task. One important trick we use is to set up a clean and separate AWS account. By then using the LinkedAccountId field in the detailed billing report, we are able to include all costs incurred during the test period, such as DynamoDB access cost, storage cost, load balancer cost, CloudWatch cost, EC2 data transfer cost, etc. This way, we guarantee that our cost calculation is as accurate as possible, while saving the hassle of having to monitor dozens of different pricing models.
However, there is one downside for this technique. The finest granularity of the detailed billing report is one hour. That means we have to run tests at the beginning of the hour and stop at the end of the hour, with each API load test running for at least one hour. In addition, DynamoDB charges on provisioned capacity. This requires us to know the capacity usage of each DynamoDB table beforehand. Instead of deriving the number from previous load tests, we use Dynamic-Dynamodb (or similar) to auto-scale the capacity beforehand. This enables us to be adaptive to changes in API internal logic. The downside however is that tests need extra time to warm up, allowing Dynamic-DynamoDB to finish its auto-scaling.
Running the tests: automate, automate, automate
To effectively running loads test regularly, we had to automate everything. We use Jenkins as the scheduler to run through the steps:
As depicted on the drawing, assume we start at midnight 00:00. Test time for each API is two hours. The first hour is needed for setup: ramping up the desired load and triggering Dynamic-Dynamodb to auto-adjust Dynamo throughput. (It may not actually take the whole hour, but we need to start the actual test at the start of the next whole hour.) The second hour is the stable test hour, when we hold everything stable and let it run for a complete hour on the clock (e.g. from 01:00 to 02:00), so that we can get the accurate dollar number for the cost from AWS detail billing report.
These are the numbers we collect for each API durings its stable hour run:
- The cost of AWS, excluding the DynamoDB provisioned capacity (CostAWS)
- The cost of the DynamoDB provisioned capacity (CostDB)
- The number of API queries served during the test hour (# of queries)
- The AWS DynamoDB provisioned capacity (ProvisionDB)
- The actual AWS DynamoDB capacity used (ActualDB)
- The average number of queries per second (QPS)
- The error count
The last two numbers aren't used in our formula but are used to make sure the tests are set up correctly. A high error count or unexpected QPS means the test was likely not set up correctly, and would need to be rerun to get accurate results.
Calculating our costs: The formula
The formula below is the sum of our actual costs for the test hour divided by the number of queries served in that hour. It gets slightly complicated so that we can prorate the cost of our actual Dynamo usage against our provisioned capacity, but is otherwise fairly straightforward:
With our automated tests and this formula, we can now precisely estimate and then validate the costs of any new or changed API we expose to customers. We use a dashboard to make this immediately visible to both executives and engineers. In turn, the team responds with business and engineering decisions to improve the product, forming a closed-loop agile process that greatly increases our efficiency while lowering the risks of innovation.
Putting the Results to Use
Here's two examples of how we use this now:
Say we introduce an awesome new feature for PlayFab that uses dozens of new and updated APIs. With automated load testing, we can quickly confirm if the cost of the new feature meets expectations, which helps the business team make the most informed decision. If the cost turns out to be higher than our pricing, for example, the engineering team may put more effort into improving efficiency of the feature, or the management team may decide to adjust the pricing to bridge the gap. On the other hand, if the cost is negligible, we can lower or eliminate the costs to our customers.
Like all engineering teams, we try to hit a balance between cranking out new features and cleaning up technical debt for a sustainable architecture. However, because shipping new features always grabs more mindshare, cleaning up technical debt, the so-called dark matter, is often overlooked. The before and after results of the automated load tests give us an easy way to see how this engineering effort improves efficiency and thus our bottom line. This in turn helps morale, as engineers are much happier when all their work, not just the flashy stuff, gets recognized and rewarded.
Having all this data up on a centralized dashboard also means that the whole team is now constantly aware of our cost structure, and any changes to it. This builds transparency and communication around decision making, and gives us the confidence to adjust our pricing model and features because we have real data backing up these moves. It also means that the engineering team is more aligned with business goals, and that critical infrastructure work is recognized for its value.
Plus, we really like the cool Minority Report-styled TV wall full of dashboards in our office.