Twitter details its home-brewed framework for simulating hardware failures

People who maintain data centers for companies don’t want to be caught by surprise in the event of a power outage or a hardware breakdown. They want to rest assured that everything will be fine in any eventuality.

Years ago, Netflix engineers developed software called Chaos Monkey, which could artificially create problems with the underlying Amazon Web Services cloud infrastructure that much of the Netflix app depends on. The idea was that engineers could figure out ahead of time how to make the system so strong that it could withstand such problems. Today, Twitter is talking for the first time about a system its programmers have built to simulate hardware failures and reveal their impact on Twitter’s software.

“This allows us to discover vulnerabilities so that we’re better prepared to handle a site-wide incident,” Twitter head of infrastructure and operations Mazdak Hashemi wrote in a blog post. “By causing failures in our own system, we’re able to build more resilient services.”

Twitter regularly builds its own software to meet its own needs. This includes open-source tools Scalding, Summingbird, and Diffy. Hashemi doesn’t say anything about whether Twitter will release the (unnamed!?) failure-testing framework under an open-source license, but it wouldn’t be surprising to see that happen eventually.

The framework includes three parts: “mischief modules” that create and then undo the failures, “monitors modules” that check to see if things are actually going haywire at Twitter (if they are, the tests stop), and “notifiers modules” to notify Twitter teams about the tests.

The framework allows Twitter to get an idea of what would happen in the event of a power loss, a network loss, or the loss of Twitter software running within a Mesos cluster (think one big group of servers that can run multiple applications).

So far, this technology has come in handy for Twitter engineers.

“This framework has driven all failure testing at Twitter over the past six months and has helped us discover numerous vulnerabilities in our stack,” Hashemi wrote. “In addition, it’s given us confidence in the failure resiliency of several of our primary systems, such as Apache Mesos and Apache Aurora, where we have tested large-scale failures which resulted in no user-facing impact.”

Check out the full blog post for more information on the framework.

VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Learn More

The insights you need without the noise