VLA-Grasp: a vision-language-action modeling with cross-modality fusion for task-oriented grasping
Jianwei Zhu, Xueying Sun, Qiang Zhang, Mingmin Liu
- Year
- 2025
- Citations
- 3
- Access
- Open access
Abstract
Task-oriented grasping (TOG) aims to predict the appropriate pose for grasping based on a specific task. While recent approaches have incorporated semantic knowledge into TOG models to enable robots to understand linguistic commands, they lack the ability to leverage relevant information from vision, language, and action. To address this problem, we propose a novel multimodal fusion grasping framework called VLA-Grasp. VLA-Grasp utilizes prompted large language model for task inference, and multi-channel multimodal encoders and cross-attention modules are proposed to capture the intrinsic links between vision-language-action, thus improving the generalization ability of the model. In addition, we introduce a multiple grasping decision method that can provide multiple feasible grasping actions. We experimentally evaluate our approach on a publicly available dataset and compare it to state-of-the-art methods. In addition, we experimentally validate our model in a real-world scenario to evaluate its performance. The results show that our method provides a reliable and efficient solution for the TOG task. The code is available at https://github.com/Jianwei915/VLA-Grasp .
Keywords
Related papers
Statistical Learning Theory
Yuhai Wu, Vladimir Vapnik
1999
Artificial intelligence: a modern approach
1995
Applied Nonlinear Control
Jean-Jacques Slotine, Weiping Li
1991
A new optimizer using particle swarm theory
R.C. Eberhart, James Kennedy
2002