Home /Research /GPTArm: An Autonomous Task Planning Manipulator Grasping System Based on Vision–Language Models
MANIPULATION

GPTArm: An Autonomous Task Planning Manipulator Grasping System Based on Vision–Language Models

Jiaqi Zhang, Zinan Wang, Jiaxin Lai, Hongfei Wang

Year
2025
Citations
4
Access
Open access

Abstract

The integration of vision–language models (VLMs) with robotic systems represents a transformative advancement in autonomous task planning and execution. However, traditional robotic arms relying on pre-programmed instructions exhibit limited adaptability in dynamic environments and face semantic gaps between perception and execution, hindering their ability to handle complex task demands. This paper introduces GPTArm, an environment-aware robotic arm system driven by GPT-4V, designed to overcome these challenges through hierarchical task decomposition, closed-loop error recovery, and multimodal interaction. The proposed robotic task processing framework (RTPF) integrates real-time visual perception, contextual reasoning, and autonomous strategy planning, enabling robotic arms to interpret natural language commands, decompose user-defined tasks into executable subtasks, and dynamically recover from errors. Experimental evaluations across ten manipulation tasks demonstrate GPTArm’s superior performance, achieving a success rate of up to 91.4% in standardized benchmarks and robust generalization to unseen objects. Leveraging GPT-4V’s reasoning and YOLOv10’s precise small-object localization, the system surpasses existing methods in accuracy and adaptability. Furthermore, GPTArm supports flexible natural language interaction via voice and text, significantly enhancing user experience in human–robot collaboration.

Keywords

Task (project management)Manipulator (device)Computer scienceArtificial intelligenceComputer visionMotion planningHuman–computer interactionRobotEngineeringSystems engineering

Related papers

Browse all MANIPULATION papers