Multimodal Emotion Recognition in Conversation (ERC) plays an influential role in the field of human-computer interaction and conversational robotics since it can motivate machines to provide empathetic services. Multimodal data modeling is an up-and-coming research area in recent years, which is inspired by human capability to integrate multiple senses. Several graph-based approaches claim to capture interactive information between modalities, but the heterogeneity of multimodal data makes these methods prohibit optimal solutions. In this work, we introduce a multimodal fusion approach named Graph and Attention based Two-stage Multi-source Information Fusion (GA2MIF) for emotion detection in conversation. Our proposed method circumvents the problem of taking heterogeneous graphs as input to the model. GA2MIF focuses on contextual modeling and cross-modal modeling through leveraging Multi-head Directed Graph ATtention networks (MDGATs) and Multi-head Pairwise Cross-modal ATtention networks (MPCATs), respectively. Extensive experiments on two public datasets (i.e., IEMOCAP and MELD) demonstrate that the proposed GA2MIF has the capacity to validly capture intra-modal long-range contextual information and inter-modal complementary information, as well as outperforms the prevalent State-Of-The-Art (SOTA) models by a remarkable margin.
翻译:在人类计算机互动和谈话机器人领域,多模式数据建模可以激励机器提供同情性服务。近年来,多模式数据建模是一个新兴研究领域,受人融合多种感知能力的启发,近年来,多模式数据建模是一个新兴研究领域。一些基于图形的方法声称可以捕捉不同模式之间的交互式信息,但多式联运数据的多样性使得这些方法禁止最佳解决办法。在这项工作中,我们采用了以双阶段多来源信息聚合(GA2MIF)为基础的双阶段图形和注意聚合法(GA2MIF),用于在谈话中检测情感。我们提出的方法绕过了将异式图作为模型投入的问题。 GA2MIF侧重于背景建模和跨模式,利用多头定向图形搜索网络(MGATs)和多头双向跨模式网络(MPCATs),分别禁止采用最佳解决办法。在两种公共数据集(i.e., IEMOCAP和MELD)上进行广泛的实验。我们提出的方法避免了将异形图图图图图作为模型输入模型的问题。GA2MIF作为长期的统计模型,以有效方式取代了国家内部空间模型。