This paper presents our submission to the Multi-Task Learning (MTL) Challenge of the 4th Affective Behavior Analysis in-the-wild (ABAW) competition. Based on visual feature representations, we utilize three types of temporal encoder to capture the temporal context information in the video, including the transformer based encoder, LSTM based encoder and GRU based encoder. With the temporal context-aware representations, we employ multi-task framework to predict the valence, arousal, expression and AU values of the images. In addition, smoothing processing is applied to refine the initial valence and arousal predictions, and a model ensemble strategy is used to combine multiple results from different model setups. Our system achieves the performance of $1.742$ on MTL Challenge validation dataset.
翻译:本文介绍我们提交第4次亲身行为分析(ABAW)竞赛的多任务学习(MTL)挑战。根据视觉特征显示,我们使用三种类型的时间编码器来捕捉视频中的时间背景信息,包括基于变压器的编码器、基于LSTM的编码器和基于GRU的编码器。通过时间背景识别显示,我们使用多任务框架来预测图像的价值、吸引力、表达和AU值。此外,还采用平滑处理来改进初始价值和刺激预测,并使用模型编码器将不同模型组合的多种结果结合起来。我们的系统在MTL挑战验证数据集上实现了1.742亿美元的性能。