在多面情感认识中视听融合联合相互关注模式 (A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition)

Gnana Praveen Rajasekar,Wheidima Carneiro de Melo,Nasib Ullah,Haseeb Aslam,Osama Zeeshan,Théo Denorme,Marco Pedersoli,Alessandro Koerich,Simon Bacon,Patrick Cardinal,Eric Granger

from arxiv, arXiv admin note: text overlap with arXiv:2111.05222

Multimodal emotion recognition has recently gained much attention since it can leverage diverse and complementary relationships over multiple modalities (e.g., audio, visual, biosignals, etc.), and can provide some robustness to noisy modalities. Most state-of-the-art methods for audio-visual (A-V) fusion rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of A-V modalities. In this paper, we focus on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos. Specifically, we propose a joint cross-attention model that relies on the complementary relationships to extract the salient features across A-V modalities, allowing for accurate prediction of continuous values of valence and arousal. The proposed fusion model efficiently leverages the inter-modal relationships, while reducing the heterogeneity between the features. In particular, it computes the cross-attention weights based on correlation between the combined feature representation and individual modalities. By deploying the combined A-V feature representation into the cross-attention module, the performance of our fusion module improves significantly over the vanilla cross-attention module. Experimental results on validation-set videos from the AffWild2 dataset indicate that our proposed A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches. The code is available on GitHub: https://github.com/praveena2j/JointCrossAttentional-AV-Fusion.

翻译：最近,多式情感的认知得到了很大的关注,因为它能够利用多种模式(如音频、视觉、生物信号等)的多样和互补关系,并为噪音模式提供一些稳健性。大多数最先进的视听(A-V)融合方法依赖于经常性网络或常规关注机制,这些网络或机制无法有效地利用A-V模式的互补性质。在本文件中,我们侧重于基于从视频中提取的面部和声音模式的融合而实现的维度情感识别。具体地说,我们提议了一个联合交叉关注模式,该模式依靠互补关系来提取A-V模式的突出特征,从而能够准确预测持续价值和振奋度。拟议的融合模式有效地利用了各种模式之间的关系,同时降低了这些功能之间的异质性。特别是,它根据组合式特征代表和个人模式之间的关联,将A-V特征代表组合应用到跨式模块中,我们组合模块的功能展示了A-V-V-Viality模式在A-Val-Val-Val-Valversal 模式上大幅改进了A-VIal-VA-S-Val-Val-Val-Val-Servial-Sal-Servial 演示模式的模拟验证模式,它能、AV-ex-ex-s-slvial-ex-s

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】长尾视觉数据识别的嵌套式协同学习方法 Nested Collaborative Learning for Long-Tailed Visual Recognition

专知会员服务

13+阅读 · 2022年3月19日

【CVPR 2022】基于层次化视觉语言知识蒸馏的开放词汇单阶段检测，Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning

专知会员服务

7+阅读 · 2022年3月19日

对比学习简述

专知会员服务

90+阅读 · 2021年6月29日

抢鲜看！13篇CVPR2020论文链接/开源代码/解读

专知会员服务

50+阅读 · 2020年2月26日