首页 /研究 /Robotic Emotion Recognition Using Two-Level Features Fusion in Audio Signals of Speech
LEARNING

Robotic Emotion Recognition Using Two-Level Features Fusion in Audio Signals of Speech

Chang Li

发表年份
2021
引用次数
8

摘要

Speech emotion recognition (SER) is a challenging task, since the definition of emotions in sentences is ambiguous. Previous research work mainly focuses on extracting hand-craft features from audio signals to feed into shallow models. Recently, Visual Geometry Group like(VGGish) has replaced traditional feature extractors, due to its effects. VGGish feature vectors were viewed as Deep Neural Network (DNN) selected from a number of features. Although the existing studies on SER have achieved promising results, they only use single-level features. This paper proposes an emotion recognition system, based on speech signals, using two-level features with position information, <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">L</b> ater <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">F</b> eature <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">F</b> usion with <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">V</b> GGish <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">O</b> verlap(LFFVO), to tackle the present limitations. First, the position information, from two-level features, is extracted by Bi-direction Long Short Time Memory (BiLSTM) neural network, followed by features fusion, to predict the emotion. The proposed method improved accuracy from 48.2% (baseline) to 69.5%, when trained, validated and evaluated using an Interactive emotional dyadic motion capture database (IEMOCAP).

关键词

Artificial intelligenceComputer scienceArtificial neural networkSpeech recognitionFeature (linguistics)Pattern recognition (psychology)

相关论文

查看 LEARNING 分类全部论文