We used two multimodal models for continuous valence-arousal recognition using visual, audio, and linguistic information. The first model is the same as we used in ABAW2 and ABAW3, which employs the leader-follower attention. The second model has the same architecture for spatial and temporal encoding. As for the fusion block, it employs a compact and straightforward channel attention, borrowed from the End2You toolkit. Unlike our previous attempts that use Vggish feature directly as the audio feature, this time we feed the pre-trained VGG model using logmel-spectrogram and finetune it during the training. To make full use of the data and alleviate over-fitting, cross-validation is carried out. The code is available at https://github.com/sucv/ABAW3.
翻译:我们使用了两个多模态模型来使用视觉、音频和语言信息进行连续情感的价值-唤醒识别。第一个模型与我们在ABAW2和ABAW3中使用的模型相同,采用的是leader-follower注意机制。第二个模型拥有相同的空间和时间编码架构。对于融合块,它采用了End2You工具包中的一个紧凑而简单的通道注意机制。与我们先前直接使用Vggish特征作为音频特征的尝试不同,这次我们使用logmel频谱图来给预训练的VGG模型进行喂食,并在训练过程中进行微调。为了充分利用数据和减轻过度拟合,进行了交叉验证。代码可在https://github.com/sucv/ABAW3上下载。