Multimodal Emotion Recognition in Conversation (ERC) plays an influential role in the field of human-computer interaction and conversational robotics since it can motivate machines to provide empathetic services. Multimodal data modeling is an up-and-coming research area in recent years, which is inspired by human capability to integrate multiple senses. Several graph-based approaches claim to capture interactive information between modalities, but the heterogeneity of multimodal data makes these methods prohibit optimal solutions. In this work, we introduce a multimodal fusion approach named Graph and Attention-based Two-stage Multi-source Information Fusion (GA2MIF) for emotion detection in conversation. Our proposed method circumvents the problem of taking heterogeneous graphs as input to the model. GA2MIF focuses on contextual modeling and cross-modal modeling through leveraging Multi-head Directed Graph ATtention networks (MDGATs) and Multi-head Pairwise Cross-modal ATtention networks (MPCATs), respectively. Extensive experiments on two public datasets (i.e., IEMOCAP and MELD) demonstrate that the proposed GA2MIF has the capacity to validly capture intra-modal long-range contextual information and inter-modal complementary information, as well as outperforms the prevalent State-Of-The-Art (SOTA) models by a remarkable margin.
翻译:在人类计算机互动和对话机器人领域,多模式数据建模能够激励机器提供同情性服务,因此在人-计算机互动和对话机器人领域发挥着有影响力的作用。近年来,多模式数据建模是一个由人综合多种感知能力的激励而即将到来的研究领域。一些基于图形的方法声称可以捕捉不同模式之间的交互式信息,但多式联运数据的多样性使得这些方法禁止最佳解决办法。在这项工作中,我们采用了名为“图形”和“关注”的双阶段多来源信息聚合(GA2MIF)的多式联运方法,用于在谈话中检测情感。我们建议的方法绕过了将多类型图作为模型投入的问题。 GA2MIF侧重于通过利用多头定向图形搜索网络(MGAATs)和多头双向跨模式网络(MPCATs)来进行背景建模和跨模式模型,这分别表明,拟议的GA2-MEC和MELD(IF-M-IF-IF-M-M-IF-M-IF-M-IF-M-IF-IF-IF-F-IF-F-F-F-F-IL-ISG-ISG-F-F-IL-F-F-F-IL-F-F-IL-ILM-IL-IL-IF-F-F-F-ILM-IF-F-F-F-F-IF-IF-F-F-F-F-F-F-F-IF-IF-F-F-IF-F-IF-IF-F-F-F-F-F-F-F-F-F-F-IF-F-F-F-F-F-F-F-F-IF-F-F-F-F-F-F-F-F-F-F-F-F-F-F-F-IF-F-F-F-F-F-F-F-F-F-F-F-F-F-F-F-F-F-F-F-F-F-F-F-IF-F-F-IF-IF-F-F-F-F-F-F-F-F-F-