Google ties with Microsoft in Microsoft's own contest for generating image captions

Google and Microsoft have come out in a dead tie for first place in the Microsoft Common Objects in Context (COCO) Captioning Challenge for automatically coming up with captions for images. The results will be formally announced on Friday at the CVPR computer-vision conference in Boston. (Update: Microsoft issued a blog post on the news on June 11, a couple days after we published this post.)

The technology from Google, described in a recent paper entitled “Show and Tell: A Neural Image Caption Generator,” performed just as well as two separate Microsoft systems — one described in the paper “From Captions to Visual Concepts and Back” and the other in the paper “Language Models for Image Captioning: The Quirks and What Works.” Technology from researchers at the University of Montreal and the University of Toronto also tied for first place in the competition, which involved categorizing several objects in hundreds of thousands of images and then writing multiple captions for every single image.

Researchers from Baidu who worked with people at the University of California, Los Angeles received a lower ranking in the competition.

Judges came up with the rankings based on the percentage of captions that were at least as good as, if not better than, human captions, and the percentage of captions that passed the Turing Test.

The competition is one of many for people working on image recognition systems. But this is the latest opportunity for Google to boast about its capabilities when it comes to analyzing both words and images at scale.

To perform so well in the competition, Google and Microsoft researchers employed a type of artificial intelligence called deep learning. It involves training systems called artificial neural networks on lots of data, like pictures, and then giving them a new piece of data to receive an inference about it in response. Deep learning works behind the scenes for many consumer-facing web applications, including the new Google Photos service.

But Google and Microsoft are constantly improving their deep learning technology, as are several other companies, like Facebook and Baidu.

Impressing talent is key at this point, with deep learning en vogue, so if nothing else, Google and Microsoft have succeeded in not looking like they lag behind other companies or academic teams.

To get a sense of what Microsoft’s cutting-edge image-captioning technology can do, check out this demo. It isn’t perfect — like Microsoft’s face-recognition technology — but it isn’t all that bad.

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.

The insights you need without the noise