Home /Research /VLA-Grasp: a vision-language-action modeling with cross-modality fusion for task-oriented grasping

MANIPULATION

VLA-Grasp: a vision-language-action modeling with cross-modality fusion for task-oriented grasping

Jianwei Zhu, Xueying Sun, Qiang Zhang, Mingmin Liu

Year: 2025
Citations: 3
Access: Open access

Abstract

Task-oriented grasping (TOG) aims to predict the appropriate pose for grasping based on a specific task. While recent approaches have incorporated semantic knowledge into TOG models to enable robots to understand linguistic commands, they lack the ability to leverage relevant information from vision, language, and action. To address this problem, we propose a novel multimodal fusion grasping framework called VLA-Grasp. VLA-Grasp utilizes prompted large language model for task inference, and multi-channel multimodal encoders and cross-attention modules are proposed to capture the intrinsic links between vision-language-action, thus improving the generalization ability of the model. In addition, we introduce a multiple grasping decision method that can provide multiple feasible grasping actions. We experimentally evaluate our approach on a publicly available dataset and compare it to state-of-the-art methods. In addition, we experimentally validate our model in a real-world scenario to evaluate its performance. The results show that our method provides a reliable and efficient solution for the TOG task. The code is available at https://github.com/Jianwei915/VLA-Grasp .

Keywords

GRASPComputational intelligenceModality (human–computer interaction)Action (physics)Task (project management)Computer scienceArtificial intelligenceHuman–computer interactionComputer visionNatural language processing

VLA-Grasp: a vision-language-action modeling with cross-modality fusion for task-oriented grasping

Abstract

Keywords

Related papers

Statistical Learning Theory

Artificial intelligence: a modern approach

Applied Nonlinear Control

A new optimizer using particle swarm theory