首页 /研究 /Enhancing Face-to-Emotion Recognition with Vision Transformer and Human-in-the-Loop Approach
HRI

Enhancing Face-to-Emotion Recognition with Vision Transformer and Human-in-the-Loop Approach

Mahedi Hasan, Naimul Islam Shuvo, Airin Akter, Fahim Shakil Tamim, Nazir Ahmed, Naeem Mia, Shahinur Alam

发表年份
2025
引用次数
1

摘要

Recognizing emotions from facial expressions is essential for applications in areas such as human-computer interaction, mental health, and social robotics. Deep learning approaches have achieved promising performance, but generalization across diverse datasets and real-world conditions is limited. In this paper, we introduced a Vision Transformer(ViT) integrated with a Human-in-the-Loop (HITL) framework that enhances emotion detection accuracy, robustness, and cross-dataset generalizability. The proposed framework incorporates human expertise during the learning and evaluation phase of the model, which allows for a targeted correction of the model output. It also helps to identify mislabeled instances and thus improves the decision boundaries with a small human effort. We conducted experiments on four benchmark datasets: FER2013, RAF-DB, AffectNet-7, and ExpW. Comparative analysis shows that incorporating Human-in-the-Loop with Vision Transformer(ViT) significantly improves the classification accuracy across all datasets, particularly in challenging cross-domain settings. To make this process more efficient, we introduced a confidence-based intervention method, in which only ambiguous predictions are reviewed, reducing the manual effort required. We also implemented incremental model updates that allow the system to continuously improve without retraining from scratch. The weights of the trained model are updated through backpropagation. This combination of human feedback and the Vision Transformer(ViT) makes the model more reliable and adaptable for use in the real world. The Vision Transformer(ViT) integrated with human feedback achieves 7%, 5%, 10%, and 13% more accuracy on FER2013, RAF-DB, AffectNet-7, and ExpW datasets than the baseline Vision Transformer(ViT) model, which outperforms the existing methods. However, our proposed model requires additional time to correct for mispredictions.

关键词

Benchmark (surveying)RetrainingProcess (computing)GeneralizationTransformerBaseline (sea)Facial expression

相关论文

查看 HRI 分类全部论文