Home /Research /OmniVTLA: Vision-Tactile-Language-Action Model with Semantic-Aligned Tactile Sensing

MANIPULATION

OmniVTLA: Vision-Tactile-Language-Action Model with Semantic-Aligned Tactile Sensing

Zhengxue Cheng, Yiqian Zhang, Wenkang Zhang, Haoyu Li, Keyu Wang, Li Song, Hengdi Zhang

Year: 2025
Access: Open access

Abstract

Recent vision-language-action (VLA) models build upon vision-language foundations, and have achieved promising results and exhibit the possibility of task generalization in robot manipulation. However, due to the heterogeneity of tactile sensors and the difficulty of acquiring tactile data, current VLA models significantly overlook the importance of tactile perception and fail in contact-rich tasks. To address this issue, this paper proposes OmniVTLA, a novel architecture involving tactile sensing. Specifically, our contributions are threefold. First, our OmniVTLA features a dual-path tactile encoder framework. This framework enhances tactile perception across diverse vision-based and force-based tactile sensors by using a pretrained vision transformer (ViT) and a semantically-aligned tactile ViT (SA-ViT). Second, we introduce ObjTac, a comprehensive force-based tactile dataset capturing textual, visual, and tactile information for 56 objects across 10 categories. With 135K tri-modal samples, ObjTac supplements existing visuo-tactile datasets. Third, leveraging this dataset, we train a semantically-aligned tactile encoder to learn a unified tactile representation, serving as a better initialization for OmniVTLA. Real-world experiments demonstrate substantial improvements over state-of-the-art VLA baselines, achieving 96.9% success rates with grippers, (21.9% higher over baseline) and 100% success rates with dexterous hands (6.2% higher over baseline) in pick-and-place tasks. Besides, OmniVTLA significantly reduces task completion time and generates smoother trajectories through tactile sensing compared to existing VLA. Our ObjTac dataset can be found at https://readerek.github.io/Objtac.github.io

Keywords

cs.RO

OmniVTLA: Vision-Tactile-Language-Action Model with Semantic-Aligned Tactile Sensing

Abstract

Keywords

Related papers

Real-Time Obstacle Avoidance for Manipulators and Mobile Robots

A Mathematical Introduction to Robotic Manipulation

Robot dynamics and control

A tutorial on visual servo control