A Survey of Unified Multimodal Understanding and Generation: Advances and Challenges
Yan Yang, Yang Shi, Wulin Xie, Yifan Zhang, Yuhao Dong, Yibo Hu, Liang Wang, Ran He, Caifeng Shan, Chaoyou Fu, Tieniu Tan
- 发表年份
- 2025
- 引用次数
- 3
- 访问权限
- 开放获取
摘要
Advancing AGI requires AI that can jointly understand and generate across modalities-text, images, video, and audio. As illustrated in Fig. 1, this unification evolves through three stages: from isolated expertise with separate models, to integrated capabilities in a unified framework, to emergent behaviors as a future vision enabling complex interleaved reasoning. This unification is motivated by two factors: (1) mutual reinforcement, where strong comprehension enables high-quality creation, and generation aids difficult reasoning via feedback loops; and (2) the flexibility to tackle complex real-world problems, such as turning a script into a coherent movie-something isolated models cannot handle. Despite promising open-source efforts (e.g., BAGEL, Emu3) and powerful closed-source models (e.g., GPT-4o, Gemini 2.0 Flash) demonstrating comparable performance, open-source unified foundation models (UFMs) still lag behind closed-source counterparts. The open-source community lacks consensus on key design choices for UFMs, such as modeling paradigms (autoregressive vs. hybrid diffusion), tokenizers, training strategies, and data curation, hampering progress. To bridge this gap, we systematically review over 700 papers to identify challenges, reveal promising directions, and accelerate development. We propose a taxonomy of UFM architectures based on coupling degree: external service-integrated, modular joint, and end-to-end unified modeling. We analyze encoding/decoding strategies across representations (continuous vs. discrete) and modalities (image, video, audio), discuss the training lifecycle (pre-training, instruction fine-tuning, alignment) with benchmarks for understanding, generation, and mixed tasks, and review applications in robotics, autonomous driving, medicine, and vision, highlighting strengths and weaknesses. Based on our analysis, we summarize trends and discuss defects, such as pure autoregressive/diffusion underperforming hybrid paradigms (which require extra objectives), dual-branch tokenizers introducing redundancy, and gaps in RL reward models, algorithms, and benchmarks. In conclusion, this work aims to guide and inspire further research in building more advanced UFMs. In conclusion, this work aims to be a foundation to inspire further researches in building more unified and capable multimodal AI systems.
关键词
相关论文
Statistical Learning Theory
Yuhai Wu, Vladimir Vapnik
1999
Fractional Differential Equations
Igor Podlubný
2025
Applied Nonlinear Control
Jean-Jacques Slotine, Weiping Li
1991
Genetic Programming: On the Programming of Computers by Means of Natural Selection
John R. Koza
1992