Home /Research /Multimodal learning with next-token prediction for large multimodal models

MANIPULATION

Multimodal learning with next-token prediction for large multimodal models

Xinlong Wang, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Zhen Li, Yuyi Wang, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Boya Wu, Bo Zhao, Bowen Zhang, Liangdong Wang, Guang Liu, Zheqi He

Year: 2026
Citations: 3

Abstract

Developing a unified algorithm that can learn from and generate across modalities such as text, images and video has been a fundamental challenge in artificial intelligence. Although next-token prediction has driven major advances in large language models1, its extension to multimodal domains has remained limited, and diffusion models for image and video synthesis2,3 and compositional frameworks that integrate vision encoders with language models4 still dominate. Here we introduce Emu3, a family of multimodal models trained solely with next-token prediction. Emu3 equals the performance of well-established task-specific models across both perception and generation, matching flagship systems while removing the need for diffusion or compositional architectures. It further demonstrates coherent, high-fidelity video generation, interleaved vision–language generation and vision–language–action modelling for robotic manipulation. By reducing multimodal learning to unified token prediction, Emu3 establishes a robust foundation for large-scale multimodal modelling and offers a promising route towards unified multimodal intelligence. Emu3 enables large-scale text, image and video learning based solely on next-token prediction, matching the generation and perception performance of task-specific methods, with implications for the development of scalable and unified multimodal intelligence systems.

Keywords

ModalitiesMultimodal learningEncoderMatching (statistics)Security tokenExtension (predicate logic)

Multimodal learning with next-token prediction for large multimodal models

Abstract

Keywords

Related papers

Real-Time Obstacle Avoidance for Manipulators and Mobile Robots

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

A Mathematical Introduction to Robotic Manipulation

Computer and Robot Vision