This paper presents our submission to the Expression Classification Challenge of the fifth Affective Behavior Analysis in-the-wild (ABAW) Competition. In our method, multimodal feature combinations extracted by several different pre-trained models are applied to capture more effective emotional information. For these combinations of visual and audio modal features, we utilize two temporal encoders to explore the temporal contextual information in the data. In addition, we employ several ensemble strategies for different experimental settings to obtain the most accurate expression recognition results. Our system achieves the average F1 Score of 0.45774 on the validation set.
翻译:本文介绍了我们提交的作品,参加第五届野外情感行为分析(ABAW)竞赛的表情分类挑战。在我们的方法中,应用了多个不同的预训练模型提取的多模态特征组合,以捕获更有效的情感信息。对于这些视觉和音频模态特征的组合,在数据中运用了两个时间编码器来探索时间上下文信息。此外,我们采用多种集成策略应对不同的实验设置,以获得最准确的表情识别结果。我们的系统在验证集上实现了平均F1值0.45774。