首页 /研究 /GPTArm: An Autonomous Task Planning Manipulator Grasping System Based on Vision–Language Models
MANIPULATION

GPTArm: An Autonomous Task Planning Manipulator Grasping System Based on Vision–Language Models

Jiaqi Zhang, Zinan Wang, Jiaxin Lai, Hongfei Wang

发表年份
2025
引用次数
4
访问权限
开放获取

摘要

The integration of vision–language models (VLMs) with robotic systems represents a transformative advancement in autonomous task planning and execution. However, traditional robotic arms relying on pre-programmed instructions exhibit limited adaptability in dynamic environments and face semantic gaps between perception and execution, hindering their ability to handle complex task demands. This paper introduces GPTArm, an environment-aware robotic arm system driven by GPT-4V, designed to overcome these challenges through hierarchical task decomposition, closed-loop error recovery, and multimodal interaction. The proposed robotic task processing framework (RTPF) integrates real-time visual perception, contextual reasoning, and autonomous strategy planning, enabling robotic arms to interpret natural language commands, decompose user-defined tasks into executable subtasks, and dynamically recover from errors. Experimental evaluations across ten manipulation tasks demonstrate GPTArm’s superior performance, achieving a success rate of up to 91.4% in standardized benchmarks and robust generalization to unseen objects. Leveraging GPT-4V’s reasoning and YOLOv10’s precise small-object localization, the system surpasses existing methods in accuracy and adaptability. Furthermore, GPTArm supports flexible natural language interaction via voice and text, significantly enhancing user experience in human–robot collaboration.

关键词

Task (project management)Manipulator (device)Computer scienceArtificial intelligenceComputer visionMotion planningHuman–computer interactionRobotEngineeringSystems engineering

相关论文

查看 MANIPULATION 分类全部论文