Home /Research /SalsaAgent: A multimodal embodied language model for interactive dance generation

HRI

SalsaAgent: A multimodal embodied language model for interactive dance generation

Payam Jome Yazdian, Zoe Stanley, Angelica Lim

Year: 2026
Access: Open access

Abstract

Interaction between humanoids involves bidirectional and nonverbal reactivity, coordination and synchrony. Toward socially aware robots and interactive virtual agents, we present SalsaAgent, a language model that generates expressive, full-body salsa dance motions in reaction to a human leader and against a contextual music backdrop. We formulate interaction as nonverbal motion token passing, extending the vocabulary of a large language model (LLM) to process discrete motion tokens, pairwise relation tokens, and audio. Our contributions include new tokens for full-body and motion relations, LLM fine-tuning using automatically derived text descriptions of skeleton dynamics for token grounding, and a two-stage token-to-diffusion pipeline. Subjective and objective evaluations demonstrate the effectiveness of our approach in terms of motion quality, music and partner coordination, and consistent two-person spatial behavior, with significant improvements over baselines.

Keywords

dance generationmultimodallanguage modelhuman-robot interactionmotion token

SalsaAgent: A multimodal embodied language model for interactive dance generation

Abstract

Keywords

Related papers

The Uncanny Valley [From the Field]

Measurement Instruments for the Anthropomorphism, Animacy, Likeability, Perceived Intelligence, and Perceived Safety of Robots

The development of Honda humanoid robot

A Meta-Analysis of Factors Affecting Trust in Human-Robot Interaction