Multimodal emotion recognition has recently gained much attention since it can leverage diverse and complementary relationships over multiple modalities (e.g., audio, visual, biosignals, etc.), and can provide some robustness to noisy modalities. Most state-of-the-art methods for audio-visual (A-V) fusion rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of A-V modalities. In this paper, we focus on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos. Specifically, we propose a joint cross-attention model that relies on the complementary relationships to extract the salient features across A-V modalities, allowing for accurate prediction of continuous values of valence and arousal. The proposed fusion model efficiently leverages the inter-modal relationships, while reducing the heterogeneity between the features. In particular, it computes the cross-attention weights based on correlation between the combined feature representation and individual modalities. By deploying the combined A-V feature representation into the cross-attention module, the performance of our fusion module improves significantly over the vanilla cross-attention module. Experimental results on validation-set videos from the AffWild2 dataset indicate that our proposed A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches. The code is available on GitHub: https://github.com/praveena2j/JointCrossAttentional-AV-Fusion.
翻译:最近,多式情感的认知得到了很大的关注,因为它能够利用多种模式(如音频、视觉、生物信号等)的多样和互补关系,并为噪音模式提供一些稳健性。大多数最先进的视听(A-V)融合方法依赖于经常性网络或常规关注机制,这些网络或机制无法有效地利用A-V模式的互补性质。在本文件中,我们侧重于基于从视频中提取的面部和声音模式的融合而实现的维度情感识别。具体地说,我们提议了一个联合交叉关注模式,该模式依靠互补关系来提取A-V模式的突出特征,从而能够准确预测持续价值和振奋度。拟议的融合模式有效地利用了各种模式之间的关系,同时降低了这些功能之间的异质性。特别是,它根据组合式特征代表和个人模式之间的关联,将A-V特征代表组合应用到跨式模块中,我们组合模块的功能展示了A-V-V-Viality模式在A-Val-Val-Val-Valversal 模式上大幅改进了A-VIal-VA-S-Val-Val-Val-Val-Servial-Sal-Servial 演示模式的模拟验证模式,它能、AV-ex-ex-s-slvial-ex-s