首页 /研究 /Multiscale Feature Extraction and Fusion of Image and Text in VQA
LEARNING

Multiscale Feature Extraction and Fusion of Image and Text in VQA

Siyu Lu, Yueming Ding, Mingzhe Liu, Zhengtong Yin, Lirong Yin, Wenfeng Zheng

发表年份
2023
引用次数
414
访问权限
开放获取

摘要

Abstract The Visual Question Answering (VQA) system is the process of finding useful information from images related to the question to answer the question correctly. It can be widely used in the fields of visual assistance, automated security surveillance, and intelligent interaction between robots and humans. However, the accuracy of VQA has not been ideal, and the main difficulty in its research is that the image features cannot well represent the scene and object information, and the text information cannot be fully represented. This paper used multi-scale feature extraction and fusion methods in the image feature characterization and text information representation sections of the VQA system, respectively to improve its accuracy. Firstly, aiming at the image feature representation problem, multi-scale feature extraction and fusion method were adopted, and the image features output of different network layers were extracted by a pre-trained deep neural network, and the optimal scheme of feature fusion method was found through experiments. Secondly, for the representation of sentences, a multi-scale feature method was introduced to characterize and fuse the word-level, phrase-level, and sentence-level features of sentences. Finally, the VQA model was improved using the multi-scale feature extraction and fusion method. The results show that the addition of multi-scale feature extraction and fusion improves the accuracy of the VQA model.

关键词

Computer scienceArtificial intelligenceFeature extractionPattern recognition (psychology)Feature (linguistics)Fuse (electrical)Representation (politics)Image (mathematics)Data miningComputer vision

相关论文

查看 LEARNING 分类全部论文