Home /Research /Depth-Aware Vision-and-Language Navigation using Scene Query Attention Network
PERCEPTION

Depth-Aware Vision-and-Language Navigation using Scene Query Attention Network

Sinan Tan, Meng-Meng Ge, Di Guo, Huaping Liu, Fuchun Sun

Year
2022
Citations
4

Abstract

Vision-and-language navigation (VLN) has been an important task in the field of Robotics and Computer Vision. However, most existing vision-and-language navigation models only use features extracted from RGB observation as input, while robots can utilize depth sensors in the real world. Existing research has also shown that simply adding a depth stream to neural models could only provide a marginal improvement to the performance of the VLN task. Therefore, in our work, we develop a novel method for the VLN task using semantic map observations built from RGB-D input. We use vision-pretraining to efficiently encode the semantic map with CNN and scene query attention network by answering queries about semantic information of specific regions of a scene. The proposed method could be used with a simple model and does not require large-scale vision-language transformer pretraining, bringing a more than 10% increase in the success rate compared with a baseline model. When used together with the Speaker-Follower training technique, it achieves a success rate of 58 % on the test set for the R2R dataset in single-run setting, outperforming the previous RGB-D method and most existing RGB-only models that do not use large-scale vision-language transformers pretraining.

Keywords

Computer scienceArtificial intelligenceTransformerRGB color modelRobotTask (project management)Computer visionTask analysisLanguage modelENCODE

Related papers

Browse all PERCEPTION papers