Cage: Causal Attention Enables Data-Efficient Generalizable Robotic Manipulation
Shangning Xia, Hongjie Fang, Cewu Lu, Haoshu Fang
- Year
- 2025
- Citations
- 3
Abstract
Generalization in robotic manipulation remains a critical challenge, particularly when scaling to new environments with limited demonstrations. This paper introduces CAGE, a novel robotic manipulation policy designed to overcome these generalization barriers by integrating the pretrained visual representation with causal attention mechanism. CAGE utilizes the powerful feature extraction capabilities of the vision foundation model DINOv2, combined with LoRA fine-tuning for robust environment understanding. The policy further employs a causal perceiver for effective token compression and a diffusion-based action head with attention to enhance task-specific fine-grained conditioning. With as few as 50 demonstrations from a single training environment, CAGE achieves robust generalization across diverse visual changes in objects, backgrounds, and viewpoints. Extensive experiments validate that CAGE significantly outperforms existing state-of-the-art RGB/RGB-D-based approaches in various manipulation tasks, especially under large distribution shifts. In similar environments, CAGE offers an average of 42 % increase in task completion rate. While all baselines fail in unseen environments, CAGE manages to obtain a 43 % completion rate and a 51 % success rate in average, marking a substantial advancement toward the practical deployment of robots in real-world settings. Project website: cage-policy.github.io.
Keywords
Related papers
Statistical Learning Theory
Yuhai Wu, Vladimir Vapnik
1999
Artificial intelligence: a modern approach
1995
Applied Nonlinear Control
Jean-Jacques Slotine, Weiping Li
1991
A new optimizer using particle swarm theory
R.C. Eberhart, James Kennedy
2002