首页 /研究 /Multimodal Deep Learning: A Survey of Models, Fusion Strategies, Applications, and Research Challenges
PERCEPTION

Multimodal Deep Learning: A Survey of Models, Fusion Strategies, Applications, and Research Challenges

Sai Teja Erukude, Suhasnadh Reddy Veluru, Viswa Chaitanya Marella

发表年份
2025
引用次数
2
访问权限
开放获取

摘要

Multimodal deep learning has become a primary methodological framework in artificial intelligence, allowing models to learn from (and reason over) many different types of data, such as text, images, audio, and video.By utilizing multiple modalities simultaneously, systems can enhance their contextual understanding, noise resilience, and generalization, all of which closely resemble human perception.This review offers a comprehensive overview of the field, taking a look at the basics of modality integration, fusion methods (early, late, and hybrid), and some of the main architectural advances in models like CLIP, Flamingo, GPT-4V, Gemini 1.5, and AudioCLIP.It also provides a primer on real-world applications in healthcare, autonomous systems, robotics, and education, including benchmarking datasets and evaluation metrics essential for evaluating performance.Notable challenges, such as modality imbalance, scalability, and interoperability, are highlighted, while also looking at growing areas of interest such as long-context modeling and embodied intelligence.As a review survey, the goal is to provide a map of options for researchers and practitioners who want to enhance their use of multimodal AI systems, both in research and in actual deployment.

关键词

Computer scienceDeep learningArtificial intelligenceFusionData scienceMachine learningHuman–computer interaction

相关论文

查看 PERCEPTION 分类全部论文