利用联合相互关注在Valence-Rousal 空间进行情感认知的视听融合 (Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space Using Joint Cross-Attention)

Automatic emotion recognition (ER) has recently gained lot of interest due to its potential in many real-world applications. In this context, multimodal approaches have been shown to improve performance (over unimodal approaches) by combining diverse and complementary sources of information, providing some robustness to noisy and missing modalities. In this paper, we focus on dimensional ER based on the fusion of facial and vocal modalities extracted from videos, where complementary audio-visual (A-V) relationships are explored to predict an individual's emotional states in valence-arousal space. Most state-of-the-art fusion techniques rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of A-V modalities. To address this problem, we introduce a joint cross-attentional model for A-V fusion that extracts the salient features across A-V modalities, that allows to effectively leverage the inter-modal relationships, while retaining the intra-modal relationships. In particular, it computes the cross-attention weights based on correlation between the joint feature representation and that of the individual modalities. By deploying the joint A-V feature representation into the cross-attention module, it helps to simultaneously leverage both the intra and inter modal relationships, thereby significantly improving the performance of the system over the vanilla cross-attention module. The effectiveness of our proposed approach is validated experimentally on challenging videos from the RECOLA and AffWild2 datasets. Results indicate that our joint cross-attentional A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches, even when the modalities are noisy or absent.

翻译：自动情绪识别(ER)最近因其在许多现实世界应用中的潜力而获得了很大的兴趣。在这方面,多式联运方法通过将多种和互补的信息来源结合起来,为吵闹和缺失的模式提供了某种强健性。在本文件中,我们侧重于基于从视频中提取的面部和声学模式组合的维维度ER,在视频中探索互补的视听(A-V)关系,以预测个人在价值-振奋空间中的情感状态。大多数最先进的融合技术依赖于不有效地利用A-V模式互补性质的经常性网络或常规关注机制。为了解决这一问题,我们引入了A-V模式联合交叉关注模式,从而得出A-V模式的突出特征和声音模式,从而有效地利用模式间的关系,同时保留了内部模式关系。特别是,它根据联合特征代表与个人模式之间的相互关系,理解了交叉关注的权重。通过将A-V模式的联合特征代表方式运用于A-V模式, 从而大大地改进了A-V版本的跨轨道模式,从而帮助了A-V模式的相互支持。