Multimodal Deep Learning: A Survey of Models, Fusion Strategies, Applications, and Research Challenges
Sai Teja Erukude, Suhasnadh Reddy Veluru, Viswa Chaitanya Marella
- Year
- 2025
- Citations
- 2
- Access
- Open access
Abstract
Multimodal deep learning has become a primary methodological framework in artificial intelligence, allowing models to learn from (and reason over) many different types of data, such as text, images, audio, and video.By utilizing multiple modalities simultaneously, systems can enhance their contextual understanding, noise resilience, and generalization, all of which closely resemble human perception.This review offers a comprehensive overview of the field, taking a look at the basics of modality integration, fusion methods (early, late, and hybrid), and some of the main architectural advances in models like CLIP, Flamingo, GPT-4V, Gemini 1.5, and AudioCLIP.It also provides a primer on real-world applications in healthcare, autonomous systems, robotics, and education, including benchmarking datasets and evaluation metrics essential for evaluating performance.Notable challenges, such as modality imbalance, scalability, and interoperability, are highlighted, while also looking at growing areas of interest such as long-context modeling and embodied intelligence.As a review survey, the goal is to provide a map of options for researchers and practitioners who want to enhance their use of multimodal AI systems, both in research and in actual deployment.
Keywords
Related papers
Statistical Learning Theory
Yuhai Wu, Vladimir Vapnik
1999
Artificial intelligence: a modern approach
1995
Applied Nonlinear Control
Jean-Jacques Slotine, Weiping Li
1991
A new optimizer using particle swarm theory
R.C. Eberhart, James Kennedy
2002