We used two multimodal models for continuous valence-arousal recognition using visual, audio, and linguistic information. The first model is the same as we used in ABAW2 and ABAW3, which employs the leader-follower attention. The second model has the same architecture for spatial and temporal encoding. As for the fusion block, it employs a compact and straightforward channel attention, borrowed from the End2You toolkit. Unlike our previous attempts that use Vggish feature directly as the audio feature, this time we feed the pre-trained VGG model using logmel-spectrogram and finetune it during the training. To make full use of the data and alleviate over-fitting, cross-validation is carried out. The fold with the highest concordance correlation coefficient is selected for submission. The code is to be available at https://github.com/sucv/ABAW5.
翻译:我们使用了两种多模型模型,使用视觉、音频和语言信息进行连续的价值-唤醒识别。第一个模型与我们在ABAW2和ABAW3中使用的模型相同,它采用领袖跟随者的注意力机制。第二个模型具有相同的空间和时间编码架构。对于融合块,它采用一个紧凑而简单的通道注意机制,借鉴了End2You工具包。与我们以前直接将Vggish特征用作音频特征的尝试不同,这次我们使用logmel-spectrogram来在训练期间对预训练的VGG模型进行微调。为了充分利用数据并减轻过拟合,进行了交叉验证。选择具有最高一致性相关系数的折叠进行提交。代码将在https://github.com/sucv/ABAW5上提供。