To enable artificial intelligence in providing empathetic services, multimodal Emotion Recognition in Conversation (ERC) plays an influential role in the field of human-computer interaction and conversational robotics. Multimodal data modeling is an up-and-coming research area in recent years, which is inspired by human multi-sensory integration capabilities. Up until now, there are few studies on multimodal-based conversational emotion recognition. Most of existing Multimodal ERC methods do not model cross-modal interactions and are incapable of extracting inter-modal complementary information. Several graph-based approaches claim to capture inter-modal complementary information, but it is difficult to obtain optimal solution using graph-based models due to the heterogeneity of multimodal data. In this work, we introduce a Graph and Attention-based Two-stage Multi-source Information Fusion (GA2MIF) approach for multimodal fusion. Our GA2MIF focuses on contextual modeling and cross-modal modeling leveraging Multi-head Directed Graph ATtention networks (MDGATs) and Multi-head Pairwise Cross-modal ATtention networks (MPCATs), respectively. Extensive experiments on two common datasets show that proposed GA2MIF can effectively capture intra-modal local and long-range contextual information as well as inter-modal complementary information, and outperforms existing State-Of-The-Art (SOTA) baselines by an absolute margin.
翻译:为使人工智能能够提供同情性服务,多式联运情感在交汇中识别(ERC)在人体计算机互动和对谈机器人领域发挥着有影响力的作用。由于多式联运数据的多样性,多式数据建模是近年来一个不断上进的研究领域,受人类多感知整合能力的启发。到目前为止,关于基于多式联运的谈话情感识别的研究很少。现有的多式情感识别方法大多不以跨模式互动为模型,无法提取跨模式互补信息。一些基于图表的方法声称要获取跨模式的补充信息,但很难利用基于图形的模型获得最佳解决方案。在这项工作中,我们采用了基于关注的双阶段多源信息聚合法(GA2MIF)方法。我们GA2MIF侧重于背景建模和跨模式模型,利用多端直接图像网络(MDATs)和多端空跨模式的跨模式搜索网络(MP2),很难利用基于图形的模型模型模型模型模型获得最佳解决方案,但很难获得最佳解决方案。我们采用了基于图表的基于图表的双向基础信息采集系统(IMA)系统,可以有效地将两个共同数据库和跨背景定位系统进行实地实验。