Google and Microsoft have come out in a dead tie for first place in the Microsoft Common Objects in Context (COCO) Captioning Challenge for automatically coming up with captions for images. The results will be formally announced on Friday at the CVPR computer-vision conference in Boston. (Update: Microsoft issued a blog post on the news on June 11, a couple days after we published this post.)
The technology from Google, described in a recent paper entitled “Show and Tell: A Neural Image Caption Generator,” performed just as well as two separate Microsoft systems — one described in the paper “From Captions to Visual Concepts and Back” and the other in the paper “Language Models for Image Captioning: The Quirks and What Works.” Technology from researchers at the University of Montreal and the University of Toronto also tied for first place in the competition, which involved categorizing several objects in hundreds of thousands of images and then writing multiple captions for every single image.
Researchers from Baidu who worked with people at the University of California, Los Angeles received a lower ranking in the competition.
Judges came up with the rankings based on the percentage of captions that were at least as good as, if not better than, human captions, and the percentage of captions that passed the Turing Test.
AI Weekly
The must-read newsletter for AI and Big Data industry written by Khari Johnson, Kyle Wiggers, and Seth Colaner.
Included with VentureBeat Insider and VentureBeat VIP memberships.
The competition is one of many for people working on image recognition systems. But this is the latest opportunity for Google to boast about its capabilities when it comes to analyzing both words and images at scale.
To perform so well in the competition, Google and Microsoft researchers employed a type of artificial intelligence called deep learning. It involves training systems called artificial neural networks on lots of data, like pictures, and then giving them a new piece of data to receive an inference about it in response. Deep learning works behind the scenes for many consumer-facing web applications, including the new Google Photos service.
But Google and Microsoft are constantly improving their deep learning technology, as are several other companies, like Facebook and Baidu.
Impressing talent is key at this point, with deep learning en vogue, so if nothing else, Google and Microsoft have succeeded in not looking like they lag behind other companies or academic teams.
To get a sense of what Microsoft’s cutting-edge image-captioning technology can do, check out this demo. It isn’t perfect — like Microsoft’s face-recognition technology — but it isn’t all that bad.
VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Learn More