Google open-sources Parsey’s Cousins, a set of parsers for 40 more languages

Jordan Novet @jordannovet August 8, 2016 9:00 AM

At a Google San Francisco office.

Image Credit: Jordan Novet/VentureBeat

Google today is announcing that it’s open-sourcing pre-trained models for parsing text in 40 languages. Think of it as an extension of Google’s decision in May to open-source the interestingly named Parsey McParseface English-language parser. The new parsers are available on GitHub under an open-source Apache license.

Parsing language might not sound like a big deal — it involves looking at a sentence and picking out the nouns, verbs, adjectives, and so on. But Parsey McParseface works at Google scale, that is to say it is very good, good enough for machines to use it to understand users’ web search queries. Researchers can now take advantage of the technology in more languages without worrying about where they’ll get the data for teaching the models.

[aditude-amp id="flyingcarpet" targeting='{"env":"staging","page_type":"article","post_id":2024118,"post_type":"story","post_chan":"none","tags":null,"ai":false,"category":"none","all_categories":"big-data,business,","session":"A"}']

Googlers didn’t stop at making Parsey McParseface fluent in more of the world’s tongues. They’ve also strengthened the underlying SyntaxNet natural language understanding library — it can now break up and analyze many words when they’re stuck together, as they are in Chinese, and it can detect different meanings based on differences in spelling, better known as morophology.

SyntaxNet is part of TensorFlow, Google’s open-source framework for deep learning. A type of artificial intelligence, deep learning involves training artificial neural networks on great quantities of data, such as strings of words, and getting them to make inferences about new data. Amazon, Facebook, Microsoft, and Twitter, among others, have open-sourced their own tools for deep learning.

AI Weekly

The must-read newsletter for AI and Big Data industry written by Khari Johnson, Kyle Wiggers, and Seth Colaner.

Included with VentureBeat Insider and VentureBeat VIP memberships.

In a blog post today, Chris Alberti, Dave Orr, and Slav Petrov of Google’s natural language understanding team explain why it’s been hard for independent researchers to make Parsey McParseface work well in many languages other than English:

The reason for that is a little bit subtle. SyntaxNet, like other TensorFlow models, has a lot of knobs to turn, which affect accuracy and speed. These knobs are called hyperparameters, and control things like the learning rate and its decay, momentum, and random initialization. Because neural networks are more sensitive to the choice of these hyperparameters than many other machine learning algorithms, picking the right hyperparameter setting is very important. Unfortunately, there is no tested and proven way of doing this and picking good hyperparameters is mostly an empirical science — we try a bunch of settings and see what works best.

For more detail on today’s release, see the rest of Alberti, Orr, and Petrov’s blog post.

VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Learn More

Explore

None Big Data Business