首页 /研究 /ImageChain: Advancing Sequential Image-to-Text Reasoning in Multimodal Large Language Models

OTHER

ImageChain: Advancing Sequential Image-to-Text Reasoning in Multimodal Large Language Models

Danae Sánchez Villegas, Ingo Ziegler, Desmond Elliott

发表年份: 2025
访问权限: 开放获取

摘要

Reasoning over sequences of images remains a challenge for multimodal large language models (MLLMs). While recent models incorporate multi-image data during pre-training, they still struggle to recognize sequential structures, often treating images independently. This work introduces ImageChain, a framework that enhances MLLMs with sequential reasoning capabilities over image data by modeling visual sequences as a multi-turn conversation. In ImageChain, images are interleaved with corresponding textual descriptions to form a controlled dialogue that explicitly captures temporal dependencies and narrative progression. Our method optimizes for the task of next-scene description, where the model generates a context-aware description of an upcoming scene based on preceding visual and textual cues. We demonstrate that our approach improves performance on the next-scene description task -- achieving an average improvement from 3.7% to 19% in SimRate, a metric that quantifies semantic similarity to human-annotated ground truths. Moreover, ImageChain achieves robust zero-shot out-of-domain performance in applications ranging from comics to robotics. Extensive experiments validate that instruction-tuning in a multimodal, multi-turn conversation design is key to bridging the gap between static image understanding and temporally-aware reasoning.

关键词

cs.CVcs.CLcs.LG

ImageChain: Advancing Sequential Image-to-Text Reasoning in Multimodal Large Language Models

摘要

关键词

相关论文

一种面向线弧增材制造的电动汽车结构可制造性拓扑优化的双环框架

几何数字孪生：一种用于航空发动机装配精度预测的数字智能模型

新型大口径偏置馈电可展开天线设计与动态性能预测

通过人工智能驱动的机器人技术革新产业