Home /Research /V2P-Manip: Learning Dexterous Manipulation from Monocular Human Videos

MANIPULATION

V2P-Manip: Learning Dexterous Manipulation from Monocular Human Videos

Kaihan Chen, Yanming Shao, Haifeng Ji, Xiaokang Yang, Yao Mu

Year: 2026
Access: Open access

Abstract

Achieving autonomous robotic dexterous manipulation requires precise, human-like action sequences at scale. As a scalable supplement to costly teleoperation data, extracting trajectories with both visual fidelity and physical plausibility from monocular videos represents a promising frontier in embodied AI. To this end, we introduce V2P-Manip, an efficient framework designed to learn dexterous manipulation policies directly from human demonstration videos. We establish an efficient, integrated pipeline encompassing 3D asset acquisition, trajectory estimation, and dexterous policy learning. To bridge the gap between visual perception and physical constraints, we introduce a two-stage refinement process to enforce spatial alignment and physical consistency. Evaluations on the TACO and OakInk benchmarks demonstrate that our approach significantly outperforms previous methods in pose accuracy, adaptability to unstructured environments, and training efficiency. Ultimately, experimental results confirm an average success rate of over 75% across multiple synthetic manipulation tasks and validate the adaptability of the extracted manipulation priors across diverse dexterous hand embodiments.

Keywords

dexterous manipulationmonocular videopolicy learning3D trajectory estimationembodied AI

V2P-Manip: Learning Dexterous Manipulation from Monocular Human Videos

Abstract

Keywords

Related papers

Real-Time Obstacle Avoidance for Manipulators and Mobile Robots

A Mathematical Introduction to Robotic Manipulation

Robot dynamics and control

A tutorial on visual servo control