Incorporating 3D Information Into Visual Question Answering

Yue Qiu, Yutaka Satoh, Ryota Suzuki, Hirokatsu Kataoka

发表年份: 2019
引用次数: 4

摘要

We propose a tactic of advancing Visual Question Answering (VQA) task by incorporating 3D information via multi-view images. Conventional VQA approaches, which reply an answer in words against a linguistic question about a given RGB image, have less ability to recognize geometrical information so that they tend to fail to count things or guess positional relationship. Moreover, they have no ability to determine blinded space, so it is not feasible to invent VQA function to robots which will work in highly-occluded real-world environments. To achieve the situation, we introduce a new multi-view VQA dataset along with an approach that incorporating 3D scene information directly captured from multi-view images into VQA without using depth images or employing SLAM. Our proposed approach achieves strong performance with an overall accuracy of 95.4% on the challenging multi-view VQA dataset setup, which contains relatively severe occlusion. This work also demonstrates the promising aspects of bridging the gap between 3D vision and language.

关键词

Computer scienceQuestion answeringBridging (networking)Artificial intelligenceTask (project management)RGB color modelRobotFunction (biology)Computer visionNatural language processing

Incorporating 3D Information Into Visual Question Answering

摘要

关键词

相关论文

Statistical Learning Theory

Artificial intelligence: a modern approach

Applied Nonlinear Control

A new optimizer using particle swarm theory