Emotions are an inherent part of human interactions, and consequently, it is imperative to develop AI systems that understand and recognize human emotions. During a conversation involving various people, a person's emotions are influenced by the other speaker's utterances and their own emotional state over the utterances. In this paper, we propose COntextualized Graph Neural Network based Multimodal Emotion recognitioN (COGMEN) system that leverages local information (i.e., inter/intra dependency between speakers) and global information (context). The proposed model uses Graph Neural Network (GNN) based architecture to model the complex dependencies (local and global information) in a conversation. Our model gives state-of-the-art (SOTA) results on IEMOCAP and MOSEI datasets, and detailed ablation experiments show the importance of modeling information at both levels.
翻译:情感是人类互动的固有部分,因此,必须开发理解和认识人类情感的人工智能系统。在涉及不同人群的谈话中,一个人的情感受到其他演讲者的话语和对讲话的情绪状态的影响。在本文中,我们提出基于多模式情感的多式情感共振(COGMEN)计算机化图形神经网络系统,利用当地信息(即发言者之间的间/内依赖)和全球信息(文字 ) 。 拟议的模型使用基于图形神经网络(GNN)的建筑来模拟对话中复杂的依赖(当地和全球信息)的模型。我们的模型给出了有关IEMOCAP和MOSEI数据集的最新艺术(SOTA)结果,详细的模拟实验显示了在两个层面建模信息的重要性。