Depth-Aware Vision-and-Language Navigation using Scene Query Attention Network
Sinan Tan, Meng-Meng Ge, Di Guo, Huaping Liu, Fuchun Sun
- Year
- 2022
- Citations
- 4
Abstract
Vision-and-language navigation (VLN) has been an important task in the field of Robotics and Computer Vision. However, most existing vision-and-language navigation models only use features extracted from RGB observation as input, while robots can utilize depth sensors in the real world. Existing research has also shown that simply adding a depth stream to neural models could only provide a marginal improvement to the performance of the VLN task. Therefore, in our work, we develop a novel method for the VLN task using semantic map observations built from RGB-D input. We use vision-pretraining to efficiently encode the semantic map with CNN and scene query attention network by answering queries about semantic information of specific regions of a scene. The proposed method could be used with a simple model and does not require large-scale vision-language transformer pretraining, bringing a more than 10% increase in the success rate compared with a baseline model. When used together with the Speaker-Follower training technique, it achieves a success rate of 58 % on the test set for the R2R dataset in single-run setting, outperforming the previous RGB-D method and most existing RGB-only models that do not use large-scale vision-language transformers pretraining.
Keywords
Related papers
Statistical Learning Theory
Yuhai Wu, Vladimir Vapnik
1999
Artificial intelligence: a modern approach
1995
Applied Nonlinear Control
Jean-Jacques Slotine, Weiping Li
1991
A new optimizer using particle swarm theory
R.C. Eberhart, James Kennedy
2002