Home /Research /ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers

MANIPULATION

ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers

Fotios Lygerakis, Ozan Özdenizci, Elmar Rückert

Year: 2025
Access: Open access

Abstract

Tactile sensing provides local essential information that is complementary to visual perception, such as texture, compliance, and force. Despite recent advances in visuotactile representation learning, challenges remain in fusing these modalities and generalizing across tasks and environments without heavy reliance on pre-trained vision-language models. Moreover, existing methods do not study positional encodings, thereby overlooking the multi-stage spatial reasoning needed to capture fine-grained visuotactile correlations. We introduce ViTaPEs, a transformer-based architecture for learning task-agnostic visuotactile representations from paired vision and tactile inputs. Our key idea is a two-stage positional injection: local (modality-specific) positional encodings are added within each stream, and a global positional encoding is added on the joint token sequence immediately before attention, providing a shared positional vocabulary at the stage where cross-modal interaction occurs. We make the positional injection points explicit and conduct controlled ablations that isolate their effect before a token-wise nonlinearity versus immediately before self-attention. Experiments on multiple large-scale real-world datasets show that ViTaPEs not only surpasses state-of-the-art baselines across various recognition tasks but also demonstrates zero-shot generalization to unseen, out-of-domain scenarios. We further demonstrate the transfer-learning strength of \emph{ViTaPEs} in a robotic grasping task, where it outperforms state-of-the-art baselines in predicting grasp success. Project page: https://sites.google.com/view/vitapes

Keywords

cs.CVcs.LGcs.RO

ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers

Abstract

Keywords

Related papers

Real-Time Obstacle Avoidance for Manipulators and Mobile Robots

A Mathematical Introduction to Robotic Manipulation

Robot dynamics and control

A tutorial on visual servo control