Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU
Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo
- 发表年份
- 2024
- 访问权限
- 开放获取
摘要
Multimodal Large Language Models (MLLMs) are distinguished by their multimodal comprehensive ability and widely used in many real-world applications including GPT-4o, autonomous driving and robotics. Despite their impressive performance, the multimodal inputs always incur long context. The inference under long context requires caching massive Key and Value states (KV cache) of previous tokens, which introduces high latency and excessive memory consumption. Due to this reason, it is challenging to deploy streaming inference of MLLMs on edge devices, which largely constrains the power and usage of MLLMs in real-world applications. In this paper, we introduce Inf-MLLM, an efficient inference framework for MLLMs, which enable streaming inference of MLLM on a single GPU with infinite context. Inf-MLLM is based on our key observation of the attention pattern in both LLMs and MLLMs called "attention saddles". Thanks to the newly discovered attention pattern, Inf-MLLM maintains a size-constrained KV cache by dynamically caching recent tokens and relevant tokens. Furthermore, Inf-MLLM proposes attention bias, a novel approach to enable MLLMs to capture long-term dependency. We show that Inf-MLLM enables multiple LLMs and MLLMs to achieve stable performance over 4M-token long texts and multi-round conversations with 1-hour-long videos on a single GPU. In addition, Inf-MLLM exhibits superior streaming reasoning quality than existing methods such as StreamingLLM and 2x speedup than H2O.
关键词
相关论文
一种面向线弧增材制造的电动汽车结构可制造性拓扑优化的双环框架
Qiang Cui, Chuan Yu, Daoqian Yang 等 5 位作者
Robotics and Computer-Integrated Manufacturing · 2026
几何数字孪生:一种用于航空发动机装配精度预测的数字智能模型
Ke Shang, Xin Jin, Teli Xu 等 7 位作者
Robotics and Computer-Integrated Manufacturing · 2026
通过人工智能驱动的机器人技术革新产业
Aryan Chaudhary
Recent Advances in Computer Science and Communications · 2026
新型大口径偏置馈电可展开天线设计与动态性能预测
Chuang Shi, Tianming Liu, Ning Xue 等 9 位作者
Aerospace Science and Technology · 2026