Microsoft researchers improve AI tech for answering questions about photos

Microsoft researchers are back at it, using a type of artificial intelligence called deep learning to build smarter software. Building on the company’s previous work, Microsoft Research employees have developed a new method of answering simple questions about the content of a photo more accurately than similar systems that have been demonstrated by other groups recently.

The approach involves two types of artificial neural networks — convolutional and a long short-term memory network — which generally train on large quantities of data, such as photos, and then make inferences about new photos that they are given. But the new method goes further by incorporating a “stacked attention network,” which effectively hones in on the key area in an image in order to answer a specific question. There are multiple layers in this network, and with each one there’s a greater level of zooming in, resulting in more accuracy.

“It’s taking on a human’s attention capability,” Li Deng, partner research manager at Microsoft Research’s Deep Learning Technology Center, said in an interview for a post today on the Next at Microsoft blog. “This is the technology that couldn’t have been imagined a few years ago — modeling human behavior to solve problems.”

The new method, which was documented this month in the paper “Stacked Attention Networks for Image Question Answering,” outperforms academic work that Microsoft researchers published earlier this year in a paper called “VQA: Visual Question Answering.”

Not that Microsoft is the only company investigating the possibilities of mixing natural language processing with image recognition.

The automatic creation of captions for photos has been explored extensively by the likes of Google as well as Microsoft. In the narrow realm of image question answering, Baidu, Huawei, and others have published papers on their progress, while Facebook recently demonstrated a mobile app that can allow blind people to ask spoken questions about what’s in a photo and receive spoken answers in response.

The big achievement here is accuracy. Especially when the system can answer questions with just one word, it performs better than what Baidu and Huawei have shown. Heck, a human being is not a whole lot better at this work. And if you look at the big picture, you know that’s a big deal.

VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Learn More

The insights you need without the noise