首页 /研究 /VLA-Grasp: a vision-language-action modeling with cross-modality fusion for task-oriented grasping
MANIPULATION

VLA-Grasp: a vision-language-action modeling with cross-modality fusion for task-oriented grasping

Jianwei Zhu, Xueying Sun, Qiang Zhang, Mingmin Liu

发表年份
2025
引用次数
3
访问权限
开放获取

摘要

Task-oriented grasping (TOG) aims to predict the appropriate pose for grasping based on a specific task. While recent approaches have incorporated semantic knowledge into TOG models to enable robots to understand linguistic commands, they lack the ability to leverage relevant information from vision, language, and action. To address this problem, we propose a novel multimodal fusion grasping framework called VLA-Grasp. VLA-Grasp utilizes prompted large language model for task inference, and multi-channel multimodal encoders and cross-attention modules are proposed to capture the intrinsic links between vision-language-action, thus improving the generalization ability of the model. In addition, we introduce a multiple grasping decision method that can provide multiple feasible grasping actions. We experimentally evaluate our approach on a publicly available dataset and compare it to state-of-the-art methods. In addition, we experimentally validate our model in a real-world scenario to evaluate its performance. The results show that our method provides a reliable and efficient solution for the TOG task. The code is available at https://github.com/Jianwei915/VLA-Grasp .

关键词

GRASPComputational intelligenceModality (human–computer interaction)Action (physics)Task (project management)Computer scienceArtificial intelligenceHuman–computer interactionComputer visionNatural language processing

相关论文

查看 MANIPULATION 分类全部论文