Home /Research /Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset

PERCEPTION

Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset

Ning Cheng, You Li, Jing Gao, Bin Fang, Jinan Xu, Wenjuan Han

Year: 2024
Access: Open access

Abstract

Tactility provides crucial support and enhancement for the perception and interaction capabilities of both humans and robots. Nevertheless, the multimodal research related to touch primarily focuses on visual and tactile modalities, with limited exploration in the domain of language. Beyond vocabulary, sentence-level descriptions contain richer semantics. Based on this, we construct a touch-language-vision dataset named TLV (Touch-Language-Vision) by human-machine cascade collaboration, featuring sentence-level descriptions for multimode alignment. The new dataset is used to fine-tune our proposed lightweight training framework, STLV-Align (Synergistic Touch-Language-Vision Alignment), achieving effective semantic alignment with minimal parameter adjustments (1%). Project Page: https://xiaoen0.github.io/touch.page/.

Keywords

cs.CVcs.RO

Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset

Abstract

Keywords

Related papers

Artificial intelligence: a modern approach

Are we ready for autonomous driving? The KITTI vision benchmark suite

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

Vision meets robotics: The KITTI dataset