首页 /研究 /Personalized Speech Emotion Recognition in Human-Robot Interaction Using Vision Transformers
HRI

Personalized Speech Emotion Recognition in Human-Robot Interaction Using Vision Transformers

Ruchik Mishra, Andrew Frye, Madan M. Rayguru, Dan O. Popa

发表年份
2025
引用次数
4

摘要

Emotions are an essential element in human verbal communication, therefore it is important to understand individuals' affect during human-robot interaction (HRI). This letter investigates the application of vision transformer models, namely ViT (Vision Transformers) and BEiT (Bidirectional Encoder Representations from Pre-Training of Image Transformers) pipelines for Speech Emotion Recognition (SER) in HRI. The focus is to generalize the SER models for individual speech characteristics by fine-tuning these models on benchmark datasets and exploiting ensemble methods. For this purpose, we collected audio data from several human subjects having pseudo-naturalistic conversations with the NAO social robot. We then fine-tuned our ViT and BEiT-based models and tested these models on unseen speech samples from the participants in order to dentify four primary emotions from speech: neutral, happy, sad, and angry. The results show that fine-tuning vision transformers on benchmark datasets and then using either these already fine-tuned models or ensembling ViT/BEiT models results in higher classification accuracies than fine-tuning vanilla-ViTs or BEiTs.

关键词

Human–robot interactionTransformerComputer scienceHuman–computer interactionSpeech recognitionEmotion recognitionRobotArtificial intelligenceComputer visionPsychology

相关论文

查看 HRI 分类全部论文