In this paper, we present our solutions for the 5th Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW), which includes four sub-challenges of Valence-Arousal (VA) Estimation, Expression (Expr) Classification, Action Unit (AU) Detection and Emotional Reaction Intensity (ERI) Estimation. The 5th ABAW competition focuses on facial affect recognition utilizing different modalities and datasets. In our work, we extract powerful audio and visual features using a large number of sota models. These features are fused by Transformer Encoder and TEMMA. Besides, to avoid the possible impact of large dimensional differences between various features, we design an Affine Module to align different features to the same dimension. Extensive experiments demonstrate that the superiority of the proposed method. For the VA Estimation sub-challenge, our method obtains the mean Concordance Correlation Coefficient (CCC) of 0.6066. For the Expression Classification sub-challenge, the average F1 Score is 0.4055. For the AU Detection sub-challenge, the average F1 Score is 0.5296. For the Emotional Reaction Intensity Estimation sub-challenge, the average pearson's correlations coefficient on the validation set is 0.3968. All of the results of four sub-challenges outperform the baseline with a large margin.
翻译:本文提出了我们在第五届野外情感行为分析研讨会(ABAW)中解决情感识别问题的方法,本次研讨会涵盖了四个子课题,包括情感价值(VA)估计、表情(Expr)分类、动作单位(AU)检测和情感反应强度(ERI)估计。第五届ABAW比赛聚焦于利用不同模态和数据集进行面部情感识别。我们使用大量基于最新技术的模型提取音频和视觉特征,然后使用Transformer编码器和TEMMA进行融合。此外,为避免不同特征之间的大量维度差异可能带来的影响,我们设计了一个仿射模块将不同特征对齐到相同的维度。大量实验表明了该方法的优越性,对于VA估计子课题,我们的方法获得了0.6066的平均一致性相关系数(CCC)。对于表情分类子课题,平均F1得分为0.4055。对于AU检测子课题,平均F1得分为0.5296。对于情感反应强度估计子课题,验证集上的平均Pearson相关系数为0.3968。四个子课题的所有结果都大大优于基准。