Home /Research /MuVAP: Multimodal Multiparty Voice Activity Projection for Turn-taking Prediction in the Wild

HRI

MuVAP: Multimodal Multiparty Voice Activity Projection for Turn-taking Prediction in the Wild

Haotian Qi, Gabriel Skantze

Year: 2026
Citations: 0
Access: Open access

Abstract

Current multiparty turn-taking models often rely on complex microphone arrays or multi-camera setups, limiting their applicability in human-robot interaction scenarios. We introduce MuVAP, a causal multimodal framework that extends Voice Activity Projection by grounding acoustic predictions in face tracks, enabling speaker-aware turn-taking predictions from a monaural audio stream and a single camera view. To address the combinatorial complexity of modeling multiple speakers, we propose Role-Relative Projection, which maps any N-speaker interaction onto a fixed current versus next floor-holder state. Because existing audiovisual datasets contain disruptive editing cuts that break causal tracking, we introduce the Audio-Visual Conversation Corpus, a 31-hour dataset of unedited, single-camera multiparty conversations. Evaluations demonstrate that MuVAP outperforms strong baselines on Shift-Hold and next-speaker prediction tasks across two- and three-speaker settings.

Keywords

turn-taking predictionmultimodalvoice activity projectionhuman-robot interactionmultiparty conversation

MuVAP: Multimodal Multiparty Voice Activity Projection for Turn-taking Prediction in the Wild

Abstract

Keywords

Related papers

The Uncanny Valley [From the Field]

Measurement Instruments for the Anthropomorphism, Animacy, Likeability, Perceived Intelligence, and Perceived Safety of Robots

The development of Honda humanoid robot

A Meta-Analysis of Factors Affecting Trust in Human-Robot Interaction