Microsoft Research revealed that it has developed a system to train machines to answer questions more like humans.
"A picture, the saying goes, is worth a thousand words. When a person is asked about something in a photo, they're taking in a lot of details - a lot of words - to answer questions about it," said Microsoft in a blog post on the new technology. "Now, a team of Microsoft researchers, together with colleagues from Carnegie Mellon University, has created a system that uses computer vision, deep learning and language understanding to analyze images and answer questions the same way humans would." The system applies multi-step reasoning to answer questions about the images, taking in information like "a human set of eyes and brain would, looking at a scene's action (if there is any) and the relationships among multiple visual objects."
"We're using deep learning in different stages: to extract visual information, to represent the meaning of the question in natural language, and to focus the attention onto narrower regions of the image in two separate steps in order to seek the precise answer," said Li Deng, one of the model's creators. "It's taking on a human's attention capability. This is the technology that couldn't have been imagined a few years ago - modeling human behavior to solve problems." Deng developed the model with fellow researchers Xiaodong He and Jianfeng Gao from the Deep Learning Technology Center of Microsoft Research, along with research intern Zichao Yang and advisor Alex Smola from Carnegie Mellon University. This advancement builds on the team's previous research, which taught machines to automatically caption images.
Photo: © Microsoft.