首页 /研究 /EndoChat: Grounded multimodal large language model for endoscopic surgery

SURGICAL

EndoChat: Grounded multimodal large language model for endoscopic surgery

Guankun Wang, Long Bai, Junyi Wang, Kun Yuan, Zhen Li, Tianxu Jiang, Xin‐Tao He, Jinlin Wu, Zhen Chen, Zhen Lei, Hongbin Liu, Jiazheng Wang, Fan Zhang, Nicolas Padoy, Nassir Navab, Hongliang Ren

发表年份: 2025
引用次数: 9

摘要

Recently, Multimodal Large Language Models (MLLMs) have demonstrated their immense potential in computer-aided diagnosis and decision-making. In the context of robotic-assisted surgery, MLLMs can serve as effective tools for surgical training and guidance. However, there is still a deficiency of MLLMs specialized for surgical scene understanding in endoscopic procedures. To this end, we present EndoChat MLLM to address various dialogue paradigms and subtasks in understanding endoscopic procedures. To train our EndoChat, we construct the Surg-396K dataset through a novel pipeline that systematically extracts surgical information and generates structured annotations based on large-scale endoscopic surgery datasets. Furthermore, we introduce a multi-scale visual token interaction mechanism and a visual contrast-based reasoning mechanism to enhance the model’s representation learning and reasoning capabilities. Our model achieves state-of-the-art performance across five dialogue paradigms and seven surgical scene understanding tasks. Additionally, we conduct evaluations with professional surgeons, who provide positive feedback on the majority of conversation cases generated by EndoChat. Overall, these results demonstrate that EndoChat has the potential to advance training and automation in robotic-assisted surgery. Our dataset and model are publicly available at https://github.com/gkw0010/EndoChat . • Propose EndoChat for visual grounding conversations in endoscopic surgery. • Develop Surg-396K, a multimodal surgical dataset with 396K image-instruction pairs. • Integrate Mixed Visual Token Engine for enhanced visual feature extraction. • Achieve superior performance in surgical scene understanding dialogues and tasks. • Positive feedback from surgeons indicates potential for training support.

关键词

Computer scienceArtificial intelligenceLinguisticsNatural language processingPhilosophy

EndoChat: Grounded multimodal large language model for endoscopic surgery

摘要

关键词

相关论文

Statistical Learning Theory

Artificial intelligence: a modern approach

Applied Nonlinear Control

A new optimizer using particle swarm theory