Data scientists need their own GitHub. Here are four of the best options

Jordan Novet @jordannovet April 1, 2014 11:47 AM

Imagine if a company’s three highly valued data scientists can happily work together without duplicating each other’s efforts and can easily call up the ingredients and results of each other’s previous work.

That day has come.

[aditude-amp id="flyingcarpet" targeting='{"env":"staging","page_type":"article","post_id":1271702,"post_type":"story","post_chan":"none","tags":null,"ai":false,"category":"none","all_categories":"big-data,business,cloud,","session":"B"}']

As the data scientist arms race continues, data scientists might want to join forces. Crazy idea, right?

Two San Francisco startups — Domino Data Lab and Sense — have emerged recently with software to let data scientists collaborate on multiple projects. In a way, it’s like code storehouse GitHub for the data science world. A Montreal startup named Plot.ly has been talking about the same themes, but it brings a more social twist.

AI Weekly

The must-read newsletter for AI and Big Data industry written by Khari Johnson, Kyle Wiggers, and Seth Colaner.

Included with VentureBeat Insider and VentureBeat VIP memberships.

Another startup, Mode Analytics, is building software for data analysts to ask questions of data without duplicating previous efforts. And at least one more mature software vendor, Alpine Data Labs, has been adding features to help many colleagues in a company apply algorithms to code on one central hub.

And this isn’t just technology for technology’s sake. People are using the software.

“Domino really makes it easy to share the building of an application, the building of data models, and doing data analysis itself,” Jonathan Dinu, a co-founder of data science school Zipfian Academy, said in an interview with VentureBeat.

Software from Domino and the other startups will look different and offer different features, but they all share a common vision. They all want to usher in a day of collaborative, not isolated, exploration of data.

It’s not as though these startups are devising completely new capabilities. For business analysts who don’t write code, software from vendors like Alteryx, Alpine, and Dataiku is useful, wrote Ben Lorica, the chief data scientist at O’Reilly Media, in an email to VentureBeat. And for those hardcore data scientists fluent in code, the IPython notebooks have become popular. But these new tools could ride the wave of the ascending age of data science.

While the technology might not score mass adoption right this minute, some investors believe it’s posed to become more popular, because it solves problems. The founders certainly believe in the advantages of their technologies, too.

[aditude-amp id="medium1" targeting='{"env":"staging","page_type":"article","post_id":1271702,"post_type":"story","post_chan":"none","tags":null,"ai":false,"category":"none","all_categories":"big-data,business,cloud,","session":"B"}']

Minimizing duplicative work

“It’s very hard to actually collaborate in the world of data analysis,” Derek Steer, Mode’s chief executive, said in an interview with VentureBeat. “It turns out that people do a lot of the same things over and over.” Steer and his team want to stop that from happening.

Steer knows the annoyance well. He and Mode cofounders Benn Stancil and Josh Ferguson worked in data analysis at Yammer and stuck around after Microsoft bought it in 2012. They left in August to start Mode, and in recent months, they have pulled in seed funding from current and former Yammer executives.

The value of Mode will come in minimizing duplicative work. Steer has explained this to us in the past:

“If you’re writing SQL [query language] queries for, example, you’re going to churn out 20 or 30 queries on your way to your final one,” Steer said. “You might repeat that process again 10 more times today. You’ve just got this incredible amount of information and you didn’t stop to add titles in front of it in or a description. Part of what Mode does is it separates things out — here’s the most valuable piece of what you’ve done. Next time someone starts working on this problem, we’re going to serve that up directly to them.”

The product is still under wraps. Steer assured me that Mode’s software will be different from what’s available from other startups. But regardless of whether that proves out, he and his contemporaries building different projects are talking about similar problems. They can save time and make better use of computing resources.

[aditude-amp id="medium2" targeting='{"env":"staging","page_type":"article","post_id":1271702,"post_type":"story","post_chan":"none","tags":null,"ai":false,"category":"none","all_categories":"big-data,business,cloud,","session":"B"}']

But the perks extend beyond that. Helping multiple colleagues to work together means the company can benefit from bringing multiple sagacious perspectives to bear on the data. Employed data scientists don’t necessarily have one common knowledge base. Some come from scientific backgrounds. Others might pack deep knowledge of their companies’ specific industries. Still others might be developers who come with statistical skills. With collaboration tools, you can let all those people look at data and draw their own conclusions.

The power of version control

That’s certainly one thing the founders of Domino Data Lab are aiming to change.

Nick Elprin, Matthew Granade, and Chris Yang started Domino after doing large-scale data analysis and developing new economic models in the research department of Bridgewater Associates, a hedge fund in Connecticut, Elprin said in an interview with VentureBeat.

The group stayed close after leaving the company in 2012 and concluded that because data science would become more important in the future, they should build a startup to simplify the process. They talked to data scientists at universities and startups. Some said it was difficult to conduct data analysis on their own individual laptops, and connecting up to a shared cluster was complicated. Others said collaborating with fellow data scientists was a pain.

[aditude-amp id="medium3" targeting='{"env":"staging","page_type":"article","post_id":1271702,"post_type":"story","post_chan":"none","tags":null,"ai":false,"category":"none","all_categories":"big-data,business,cloud,","session":"B"}']

Elprin and Yang are software engineers by training, and one thing that’s dramatically accelerated their craft is version control — the ability to see the changes to code that developers make and go back to a previous version if something goes wrong with new code.

Above: Nick Elprin of Domino Data Lab.

Image Credit: Jordan Novet/VentureBeat

“A lot of the analysts and data scientists we know don’t do version control,” Elprin said. “If you ask them what do you do if you want to work on different versions of an experiment with different parameters, they say, ‘Well, I make a copy of my files. I have scripts and script [copies]. How do you ensure you can get back to your old versions of your work? … You use version control.”

But because Domino applies those scripts to data sets, the company’s software remembers which input data was used with which scripts. That’s because the company wants to make sure its users can always reproduce the results they come up with.

“If you start by just letting people have the exact version of the source code and the input data that produced a certain result, that actually gets you a lot of the way there,” Elprin said.

[aditude-amp id="medium4" targeting='{"env":"staging","page_type":"article","post_id":1271702,"post_type":"story","post_chan":"none","tags":null,"ai":false,"category":"none","all_categories":"big-data,business,cloud,","session":"B"}']

The startup goes one step further and stores information on which libraries of algorithms data scientists put to work.

Some people using Domino now have previously tried storing their code in GitHub, Elprin said, but they ran into issues. They found it too hard to learn quickly, or they didn’t like how it couldn’t manage large files, or they weren’t able to see the output of their code, he said.

The startup hasn’t yet announced prices after opening up the public beta a few months ago, although it has signed up its first enterprise customer, Elprin wrote in an email. The customer is running its analytical models on Domino to do better target marketing campaigns, and no longer having to deal with the complexity of maintaining virtual resources on the Amazon Web Services public cloud, Elprin wrote.

That’s certainly one advantage of managed tools like Domino — and it also applies to Sense. But it’s not only that.

[aditude-amp id="medium5" targeting='{"env":"staging","page_type":"article","post_id":1271702,"post_type":"story","post_chan":"none","tags":null,"ai":false,"category":"none","all_categories":"big-data,business,cloud,","session":"B"}']

Dropbox doesn’t cut it

“The question that all companies I talk to are facing is how do we do more, and how do we do it quicker,” Sense co-founder and chief executive Tristan Zajonc said in an interview with VentureBeat. “How do we get more value from data, and how do we get that value faster — so, with less resources?”

Sense started after Zajonc and co-founder Anand Patil, who both have extensive training in statistics, realized it’s wise for companies to use the cloud to run lots of models at once, but sharing visualizations and other findings ain’t easy. So they needed a central cloud-based place for that to happen. After getting to know each other in the context of statistics research, they started building a “next-generation version of R” — a language and line of software for performing statistics — and they figured out they would also need to build “a whole platform and user experience,” Zajonc said.

That’s Sense.

“Dropbox is good for sharing files, but it’s not good for sharing actual results,” Zajonc said. “It’s just very awkward to share the output of your files. It’s hard to scale your analysis.”

Above: Tristan Zajonc of Sense.

Image Credit: Jordan Novet/VentureBeat

Even if it makes sense for data scientists to use public-cloud infrastructure, not every company will feel comfortable with people throwing numbers outside trusted corporate data centers. So Sense also makes its software available for on-premises deployments.

Time will tell if Sense ends up posing a threat to statistics software from legacy vendors like SAS. Zajonc, for one, is optimistic.

“We think all data-driven scientists and businesses should be on Sense,” he said.

The social network of data

Of course the Sense team thinks so. At the same time, Plot.ly’s team believes its approach will win lots and lots of users, too. It supports Microsoft Excel spreadsheets, Python scripts, files from MATLAB software, and other tools for manipulating and analyzing data. And it has an API that lets people push in streaming data and update their charts of data in real time.

But while it boasts technical chops and rich visualization capabilities, Plot.ly is less about giving a platform for two data scientists and more about becoming a common meeting place for people interested in data.

“Once you’re graphing data on Plot.ly, it’s discoverable by other people,” co-founder and chief executive Jack Parmer said in an interview with VentureBeat. “You can follow people that have similar interests to you. You can … get updates on their data, and you can write comments on their data. It’s really similar to GitHub — or Instagram, where, instead of a photo, it’s a graph. We really emphasize the community aspect.”

Parmer and fellow co-founder Alex Johnson first met at solar-panel-installation company Alion Energy, where they wrote software to track activity on manufacturing lines. They did similar work at other science-oriented startups, Parmer said, before they decided to “just make one super-platform that would probably work for all of these companies.” That’s how Plot.ly got started.

After starting last year, the startup has racked up users at lots of companies, and a few, like SpaceX, are paying for it. Plot.ly picked up a $1.5 million funding round with contributions from Rho Canada Ventures. In order to bring in revenue while trying to grow its community of free users, the startup is working on a version of its software for on-premises deployments, which could be good for companies that don’t want sensitive data ending up on someone else’s servers.

“We don’t expect to be profitable within a year,” Parmer said. “What we really want to focus on is being the visualization and the version-control layer for all these data sources and data tools as fast as possible.”

There are only so many data scientists

The challenge facing startups like Sense and Domino especially is that they don’t have the potential to sign up everyone as a user the way a social network, an email client, or a marketing-automation tool might be able to. They’re cut out for people with specific know-how. And data science is still nascent. So at this point, the market size isn’t huge. Which means some VCs might be slow to invest. Still, these types of software could hit their stride in the years to come, as data scientists become more common.

“I’m a believer that data collaboration is, you know, one of the missing links, and one of the important things to solve in this new big data era,” Ping Li, a general partner at Accel Partners, said in an interview with VentureBeat.

“Once people start doing data analysis, they need a better platform to share and collaborate on data analysis.”

If anything, investors will favor the products that are accessible to as many people as possible — and elegant, at that.

“It will take some time for iteration to kind of get there, but … I think its a pain point, and i think it will become a bigger pain point over time,” Li said.

VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Learn More

Explore

None Big Data Business Cloud