首页 /研究 /MinD: Learning A Dual-System World Model for Real-Time Planning and Implicit Risk Analysis
MANIPULATION

MinD: Learning A Dual-System World Model for Real-Time Planning and Implicit Risk Analysis

Xiaowei Chi, Kuangzhi Ge, Jiaming Liu, Siyuan Zhou, Peidong Jia, Zichen He, Yuzhen Liu, Tingguang Li, Lei Han, Sirui Han, Shanghang Zhang, Yike Guo

发表年份
2025
访问权限
开放获取

摘要

Video Generation Models (VGMs) have become powerful backbones for Vision-Language-Action (VLA) models, leveraging large-scale pretraining for robust dynamics modeling. However, current methods underutilize their distribution modeling capabilities for predicting future states. Two challenges hinder progress: integrating generative processes into feature learning is both technically and conceptually underdeveloped, and naive frame-by-frame video diffusion is computationally inefficient for real-time robotics. To address these, we propose Manipulate in Dream (MinD), a dual-system world model for real-time, risk-aware planning. MinD uses two asynchronous diffusion processes: a low-frequency visual generator (LoDiff) that predicts future scenes and a high-frequency diffusion policy (HiDiff) that outputs actions. Our key insight is that robotic policies do not require fully denoised frames but can rely on low-resolution latents generated in a single denoising step. To connect early predictions to actions, we introduce DiffMatcher, a video-action alignment module with a novel co-training strategy that synchronizes the two diffusion models. MinD achieves a 63% success rate on RL-Bench, 60% on real-world Franka tasks, and operates at 11.3 FPS, demonstrating the efficiency of single-step latent features for control signals. Furthermore, MinD identifies 74% of potential task failures in advance, providing real-time safety signals for monitoring and intervention. This work establishes a new paradigm for efficient and reliable robotic manipulation using generative world models.

关键词

cs.ROcs.AI

相关论文

查看 MANIPULATION 分类全部论文