Human emotion recognition plays an important role in human-computer interaction. In this paper, we present our approach to the Valence-Arousal (VA) Estimation Challenge, Expression (Expr) Classification Challenge, and Action Unit (AU) Detection Challenge of the 5th Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW). Specifically, we propose a novel multi-modal fusion model that leverages Temporal Convolutional Networks (TCN) and Transformer to enhance the performance of continuous emotion recognition. Our model aims to effectively integrate visual and audio information for improved accuracy in recognizing emotions. Our model outperforms the baseline and ranks 3 in the Expression Classification challenge.
翻译:情感识别在人机交互中起着重要作用。本文提出了一种新的多模态融合模型,它利用了Temporal Convolutional Networks(TCN)和Transformer,以提高连续情感识别的性能。我们的模型旨在有效地整合视觉和音频信息,以提高识别情感的准确性。我们的模型优于基线,并在Expression Classification Challenge中排名第3。