首页 /研究 /ALOHA: Adapting Local Spatio-Temporal Context to Enhance the Audio-Visual Semantic Segmentation
PERCEPTION

ALOHA: Adapting Local Spatio-Temporal Context to Enhance the Audio-Visual Semantic Segmentation

Yanghao Zhou, Heyan Huang, Cunhan Guo, Rong-Cheng Tu, Zeyu Xiao, Bo Wang, Xian-Ling Mao

发表年份
2025
引用次数
2

摘要

Audio-Visual Semantic Segmentation (AVSS) plays a crucial role in pixel-level multi-modal perception for real-world applications such as robotic navigation and autonomous driving. Existing methods typically rely on global spatio-temporal modules to fuse audio and visual representations, which aids in generating pixel-level semantic masks. However, these approaches often overlook the importance of local spatio-temporal context in understanding semantics, leading to suboptimal performance. This limitation makes it difficult for models to accurately distinguish sound-emitting objects from irrelevant background noise, resulting in erroneous segmentation across the spatio-temporal dimension. To address this issue, we propose the ALOHA framework, which A dapts LO cal spatio-temporal context to en HA nce AVSS. The framework introduces two key components designed to leverage and enhance local spatio-temporal context information: the LOHA adapter and the Selective Context Enhancement (SCE) module. Specifically, the LOHA adapter adaptively captures essential modality information across spatio-temporal dimensions, while implicitly learning fine-grained local context through the local attention mechanism. Furthermore, the SCE module selectively enhances the local context related to the semantics, thereby facilitating the distinction between the sounding object and irrelevant background and improving segmentation accuracy. Moreover, to better adapt to embodied AI systems, our framework utilizes a parameter-shared encoder and applies the adapters in a staged manner. This design significantly reduces the number of trainable parameters, making it more parameter-efficient. Experimental results demonstrate that the proposed framework achieves state-of-the-art performance on the AVSBench-Semantic benchmark dataset and shows competitive results on the AVSBench-Object benchmark, while exhibiting broad adaptability across different visual backbone networks.

关键词

Computer scienceSegmentationAudio visualContext (archaeology)AlohaArtificial intelligenceMultimediaHuman–computer interactionComputer visionTelecommunications

相关论文

查看 PERCEPTION 分类全部论文