This paper presents our system for the Multi-Task Learning (MTL) Challenge in the 4th Affective Behavior Analysis in-the-wild (ABAW) competition. We explore the research problems of this challenge from three aspects: 1) For obtaining efficient and robust visual feature representations, we propose MAE-based unsupervised representation learning and IResNet/DenseNet-based supervised representation learning methods; 2) Considering the importance of temporal information in videos, we explore three types of sequential encoders to capture the temporal information, including the encoder based on transformer, the encoder based on LSTM, and the encoder based on GRU; 3) For modeling the correlation between these different tasks (i.e., valence, arousal, expression, and AU) for multi-task affective analysis, we first explore the dependency between these different tasks and propose three multi-task learning frameworks to model the correlations effectively. Our system achieves the performance of $1.7607$ on the validation dataset and $1.4361$ on the test dataset, ranking first in the MTL Challenge. The code is available at https://github.com/AIM3-RUC/ABAW4.
翻译:本文介绍了我们在第四期 " 多任务学习(MTL)挑战 " 竞赛中面临的多种行为分析(ABAW)系统,我们从三个方面探讨这项挑战的研究问题:(1) 为了获得高效和稳健的视觉特征展示,我们提出了基于MAE的无监督代表性学习和IResNet/DenseNet/DenseNet的监管代表性学习方法;(2) 考虑到视频中时间信息的重要性,我们探索了三种类型的连续序列编码器,以捕捉时间信息,包括基于变压器的编码器、基于LSTM的编码器和基于GRU的编码器;(3) 为模拟这些不同任务(即价值、振荡、表达和AU)之间在多重任务影响分析方面的相互关系,我们首先探讨了这些不同任务之间的依赖性,并提出了三个多任务学习框架,以有效模拟这些关联性。我们的系统在验证数据集上取得了1.7607美元的业绩,在测试数据集上取得了1.4361美元的业绩,测试数据组数据集在GRU/ABABM3挑战中排名第一。