Home /Research /A Survey of Unified Multimodal Understanding and Generation: Advances and Challenges
OTHER

A Survey of Unified Multimodal Understanding and Generation: Advances and Challenges

Yan Yang, Yang Shi, Wulin Xie, Yifan Zhang, Yuhao Dong, Yibo Hu, Liang Wang, Ran He, Caifeng Shan, Chaoyou Fu, Tieniu Tan

Year
2025
Citations
3
Access
Open access

Abstract

Advancing AGI requires AI that can jointly understand and generate across modalities-text, images, video, and audio. As illustrated in Fig. 1, this unification evolves through three stages: from isolated expertise with separate models, to integrated capabilities in a unified framework, to emergent behaviors as a future vision enabling complex interleaved reasoning. This unification is motivated by two factors: (1) mutual reinforcement, where strong comprehension enables high-quality creation, and generation aids difficult reasoning via feedback loops; and (2) the flexibility to tackle complex real-world problems, such as turning a script into a coherent movie-something isolated models cannot handle. Despite promising open-source efforts (e.g., BAGEL, Emu3) and powerful closed-source models (e.g., GPT-4o, Gemini 2.0 Flash) demonstrating comparable performance, open-source unified foundation models (UFMs) still lag behind closed-source counterparts. The open-source community lacks consensus on key design choices for UFMs, such as modeling paradigms (autoregressive vs. hybrid diffusion), tokenizers, training strategies, and data curation, hampering progress. To bridge this gap, we systematically review over 700 papers to identify challenges, reveal promising directions, and accelerate development. We propose a taxonomy of UFM architectures based on coupling degree: external service-integrated, modular joint, and end-to-end unified modeling. We analyze encoding/decoding strategies across representations (continuous vs. discrete) and modalities (image, video, audio), discuss the training lifecycle (pre-training, instruction fine-tuning, alignment) with benchmarks for understanding, generation, and mixed tasks, and review applications in robotics, autonomous driving, medicine, and vision, highlighting strengths and weaknesses. Based on our analysis, we summarize trends and discuss defects, such as pure autoregressive/diffusion underperforming hybrid paradigms (which require extra objectives), dual-branch tokenizers introducing redundancy, and gaps in RL reward models, algorithms, and benchmarks. In conclusion, this work aims to guide and inspire further research in building more advanced UFMs. In conclusion, this work aims to be a foundation to inspire further researches in building more unified and capable multimodal AI systems.

Keywords

UnificationKey (lock)Feature (linguistics)Term (time)

Related papers

Browse all OTHER papers