Home /Research /Lightweight Multimodal Emotion Recognition for Companion Robots: A Deep Learning Framework Integrating Facial and Speech Features

HRI

Lightweight Multimodal Emotion Recognition for Companion Robots: A Deep Learning Framework Integrating Facial and Speech Features

Cheng-Kai Lu, Chien-Wei Lu, Guan Bo Lin

Year: 2025
Citations: 1

Abstract

This paper presents a lightweight multimodal deep learning framework for real-time emotion recognition on resource-constrained companion robots, exemplified by Zenbo Junior II. The framework integrates a customized GhostNet with Triplet Attention Modules (TAM) and a Frame Attention Network (FAN) for spatio-temporal facial feature encoding, and employs a depth-optimized one-dimensional convolutional neural network (1D-CNN) for compact speech representation. Decision-level fusion based on the geometric mean enhances robustness to noisy modality predictions. The proposed model comprises 0.92 million parameters and requires 0.77 billion floating-point operations (GFLOPs), achieving 97.56% accuracy on the RAVDESS dataset and 82.33% on CREMA-D. In contrast to existing approaches that optimize accuracy at the expense of computational efficiency, the proposed method demonstrates a balance of accuracy, efficiency, and deployability. These results highlight both the novelty and the feasibility of the framework for real-time emotion monitoring in healthcare and human-robot interaction.

Keywords

Deep learningRobustness (evolution)Convolutional neural networkEmotion recognitionFacial expressionNoveltyFeature extractionFeature (linguistics)

Lightweight Multimodal Emotion Recognition for Companion Robots: A Deep Learning Framework Integrating Facial and Speech Features

Abstract

Keywords

Related papers

The spread of true and false news online

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

3D is here: Point Cloud Library (PCL)

A guide to deep learning in healthcare