Facebook releases 1.6GB of children’s stories for training its AI

Jordan Novet @jordannovet February 18, 2016 1:19 PM

Some of the data Facebook released today.

Image Credit: Screenshot

Facebook today announced that it has released the data it used to train its artificial intelligence software to understand children’s stories and predict the word that was missing from a given sentence in a story.

The data set (.tgz) comes out to more than 1.6GB, and it’s affiliated with a recently published academic paper called “The Goldilocks Principle: Reading Children’s Books with Explicit Memory Representations.” Facebook chief executive Mark Zuckerberg provides a good overview of the research today in a Facebook post:

[aditude-amp id="flyingcarpet" targeting='{"env":"staging","page_type":"article","post_id":1878579,"post_type":"story","post_chan":"none","tags":null,"ai":false,"category":"none","all_categories":"big-data,business,","session":"B"}']

Language is one of the most complex things for computers to understand. Guessing how to complete a sentence is pretty easy for people but much more difficult for machines. Historically, computers have been able to predict simple words like “on” or “at” and verbs like “run” or “eat”, but they don’t do as well at predicting nouns like “ball”, “table” or people’s names.

For this research, our team taught the computer to look at the context of a sentence and much more accurately predict those more difficult words — nouns and names — which are often the most important parts of sentences. The computer’s predictions were most accurate when it looked at just the right amount of context around relevant words — not too much and not too little. We call this “The Goldilocks Principle”.

Now the data set, which draws from books that are available from the volunteer-led Gutenberg Project, is accessible to academic researchers and even researchers in other companies that are keen to improve language understanding systems for their applications.

Facebook has previously open-sourced some of its artificial intelligence source code — as have other major web companies — and even shared designs for its artificial intelligence servers. Data releases are another way for Facebook to share its tooling to advance research.

AI Weekly

The must-read newsletter for AI and Big Data industry written by Khari Johnson, Kyle Wiggers, and Seth Colaner.

Included with VentureBeat Insider and VentureBeat VIP memberships.

Yahoo, another company that engages in artificial intelligence research, recently released a 13TB dataset that can be used for machine learning research, but it’s only available to people affiliated with academic institutions.

More information on Facebook Artificial Intelligence Research’s “Children’s Book Test” is here.

VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Learn More

Explore

None Big Data Business