How Netflix keeps its cloud strong by using virtual 'monkeys' to unplug servers at random

Dylan Tweney @dylan20 September 10, 2013 5:44 PM

Ariel Tseitlin, the director of cloud solutions at Netflix, onstage at CloudBeat 2013.

Image Credit: Michael O'Donnell/VentureBeat

SAN FRANCISCO — What does not kill me, makes me stronger. So said Nietzsche, Conan the Barbarian, and Kelly Clarkson.

Now Netflix cloud director Ariel Tseitlin is taking that philosophy to its natural limit in the world of the cloud. Every day, he unleashes an army of virtual monkeys on his company’s computing infrastructure, trying to kill it. Every day, it survives — and it gets stronger, more resilient, and more resistant to real outages. By now, it is almost unkillable.

[aditude-amp id="flyingcarpet" targeting='{"env":"staging","page_type":"article","post_id":811270,"post_type":"story","post_chan":"none","tags":null,"ai":false,"category":"none","all_categories":"business,cloud,dev,security,","session":"A"}']

As a result, Netflix has managed to stay online even while other users of Amazon’s cloud system have gone offline.

“Their sole purpose is to make sure that we’re failing in a consistent and frequent enough way to make sure that we don’t drift into overall failure,” Tseitlin said today at CloudBeat 2013, VentureBeat’s conference on the enterprise cloud.

AI Weekly

The must-read newsletter for AI and Big Data industry written by Khari Johnson, Kyle Wiggers, and Seth Colaner.

Included with VentureBeat Insider and VentureBeat VIP memberships.

Netflix is perhaps the highest-profile consumer of cloud services, and it’s certainly one of the largest. It uses Amazon Web Services to host its streaming-media business, which by some measures accounts for one-third of the data streaming into American households.

Tseitlin uses a suite of software that he wryly terms a “simian army” to go around and wreak mayhem in its virtual data center.

Each “monkey” does a different thing: not just simulating failure, but actually making things go wrong in Netflix’s production environment. For instance, the “chaos monkey” works like this: Every weekday, at a random time between 9 a.m. and 5 p.m., it randomly scans the production environment, rolls the dice, and picks some real production instance to terminate.

“The design premise there is that all of the architecture is resilient enough to retry and to begin re-serving the experience in a way that is completely transparent to the customer,” Tseitlin said. “You as the viewer should have no idea that the instance that was serving up your movie was just terminated.”

Why only on weekdays? The idea is to uncover problems with the infrastructure when there are engineers on duty to fix them, if something really does go wrong. By uncovering these problems when engineers are on call, Netflix saves itself the grief of having to solve them at 3 a.m. on a Sunday.

Of course, that only works because Netflix already has a remarkably strong cloud infrastructure. It got that way, Tseitlin said, after a major outage the company experienced in 2008, when it was still primarily a DVD-by-mail business and was running all its IT in its own data center. The outage, which originated in the company’s Oracle databases, took almost a week to correct.

[aditude-amp id="medium1" targeting='{"env":"staging","page_type":"article","post_id":811270,"post_type":"story","post_chan":"none","tags":null,"ai":false,"category":"none","all_categories":"business,cloud,dev,security,","session":"A"}']

“If we had such an outage while we were a streaming company, this could have been an existential threat to us,” Tseitlin said. “That made us think about what we needed to do to build a resilient service.”

So Netflix started a project of rebuilding everything, moving to more modern architecture, and moved its systems into the cloud (using Amazon Web Services).

Why Amazon? “We knew we didn’t want to build it,” Tseitlin said. “We looked around at the options out there, and Amazon was the only one that had the scale we needed and the feature breadth that we wanted.”

As part of the project, the company built resiliency into its architecture throughout, following two main design principles: isolation and redundancy. Isolation means that a failure in one component can’t bring down the entire edifice, while redundancy means that every component is backed up by an alternative in case it fails.

[aditude-amp id="medium2" targeting='{"env":"staging","page_type":"article","post_id":811270,"post_type":"story","post_chan":"none","tags":null,"ai":false,"category":"none","all_categories":"business,cloud,dev,security,","session":"A"}']

It might sound expensive to build two (or more) of everything, but Tseitlin says it’s really about allocating expenses more rationally — and that it actually results in an overall savings. Netflix may be spending more money upfront, but it is saving money by avoiding last-minute panic situations.

“By doing this, you’re not using in aggregate a larger amount of resources, but the way you’re distributing them makes it possible for you to be more resilient in the long term,” Tseitlin said.

Netflix not only uses Amazon Web Services, it has also open-sourced many of its additional software through a project called Netflix OSS. Tseitlin noted that there is currently a $100,000 prize on offer for the best contribution toward that project. You have until Sept. 15 to make your contributions, so developers: Get coding.

Maybe you could make the next chaos monkey to rampage through Netflix’s virtual data center.

[aditude-amp id="medium3" targeting='{"env":"staging","page_type":"article","post_id":811270,"post_type":"story","post_chan":"none","tags":null,"ai":false,"category":"none","all_categories":"business,cloud,dev,security,","session":"A"}']

VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Learn More

Explore

None Business Cloud Dev Security