Another day, another story about an online game that failed at launch due to scalability problems. This time it’s Evolution Studio’s racing sim Driveclub, but this is hardly the first that went down on launch day. These issues are not even limited to games. Remember the Healthcare.gov debacle?
Why does this keep happening? Because, frankly, building systems that scale is really, really hard, and it’s something schools almost never teach.
Designing and building a large-scale data-intensive backend is no small undertaking, and this requires a fundamentally different set of skills than building a traditional single-player game. Many things that work fine at low scale can come crashing down as soon as traffic increases. Getting it right typically requires a lot of insight into precisely what’s going on under the hood. Libraries and frameworks, for example, may be making assumptions that fail at load.
Ultimately, you have no substitute for realistic load testing, but too often games go live with minimal scale testing because teams either don’t want to spend the money or simply run out of time to perform adequate testing. Simulating the load from millions of players can be expensive — a single load test performed at casual game developer PopCap for Plants vs. Zombies 2 cost nearly $100,000 in temporary server instances leased from Amazon Web Services. But it was worth it. PvZ 2 was one of the rare games that experienced massive load at launch but experienced no outages, only because the developer found and fixed many issues during testing.
Some factors might seem unrelated, but these can, in fact, have serious performance impacts on each other. For example, if your server is processing a lot of data and is blocked on I/O operations, trying to create another process to handle the additional load can then also block on I/O operations, making your situation worse.
Sometimes, modern techniques meant to fix one set of problems can cause another. Most modern systems feature nonblocking, asynchronous function calls. These are great because your code will not freeze up waiting for a function call to complete. But every call that doesn’t return immediately consumes server resources needed to wait for the eventual return. Also, nonblocking calls can easily create hard-to-fix “race condition” bugs if you’re not familiar with thread safety.
Data storage also presents a daunting set of possible issues. To ensure no information gets lost in the event of a server failure, most data storage servers are set up in redundant clusters where data is automatically copied between machines. This sounds like a good thing, and it generally is But this can also have unintended consequences.
Has the overhead of the data copying between servers been factored into the total available load? Do you have a plan for splitting out data between new servers as it grows? If a million records are added to a single table, how will it affect performance? Can a table grow so large that the servers will run out of space? Does every data update lock other operations from being performed, thus leading to data gridlock? What about table scans, which can lock a huge portion of the data? What happens when the user facing code can’t connect to the backend storage system? Is it queued, blocked, or failed?
These questions have no easy answers, and there are many more which can be posed. For example, what performance metric will you use to trigger new servers being brought online? CPU usage, though seemingly obvious, is rarely the best answer (the right answer is typically more associated with IO operations).
The saddest part of that story about Driveclub is the quote that they have 120 people working to fix the issues. If true, it would mean that they got slammed by so many gotchas they didn’t know existed that the whole system has come crashing down. The only reason they would be updating the client is to lower the load to give themselves breathing room, but typically, these sort of problems can only really be solved by re-architecting the backend.
Consider Electronic Arts’ Simpson’s Tapped Out. Today, it’s one of the most successful mobile games on the market. But when it first launched, it had so many problems that EA had to remove it from the AppStore while it re-architected and retested the entire game.
We are very sympathetic to the issues faced by the developers at Evolution Studios. The issues they are facing right now are every backend engineer’s nightmare. We’ve tried to do everything we can here at PlayFab to avoid similar issues, including massive load testing and careful architecture design, but there’s always the danger that we overlooked something — and that’s what made reading this article so painful!
Siva Katir is a senior backend engineer at PlayFab, Inc., the first complete Game Operations Platform. With more than a decade of development experience, he is an integral part of the PlayFab team who provide a reliable and cost-effective suite of backend services and tools for building and operating any live game.