Automatic emotion recognition (AER) based on enriched multimodal inputs, including text, speech, and visual clues, is crucial in the development of emotionally intelligent machines. Although complex modality relationships have been proven effective for AER, they are still largely underexplored because previous works predominantly relied on various fusion mechanisms with simply concatenated features to learn multimodal representations for emotion classification. This paper proposes a novel hierarchical fusion graph convolutional network (HFGCN) model that learns more informative multimodal representations by considering the modality dependencies during the feature fusion procedure. Specifically, the proposed model fuses multimodality inputs using a two-stage graph construction approach and encodes the modality dependencies into the conversation representation. We verified the interpretable capabilities of the proposed method by projecting the emotional states to a 2D valence-arousal (VA) subspace. Extensive experiments showed the effectiveness of our proposed model for more accurate AER, which yielded state-of-the-art results on two public datasets, IEMOCAP and MELD.
翻译:基于包括文字、言语和视觉线索在内的丰富多式联运投入的自动情绪识别(AER)对于情感智能机器的开发至关重要。虽然事实证明复杂的模式关系对AER有效,但是,由于以前的工作主要依赖各种融合机制,这些混合机制只是集成特性,学习多式情感分类的多式表达方式,因此这些复杂模式关系在很大程度上仍然没有得到充分探讨。本文提出了一个新型的等级组合图集图集模型模型(HFGCN)模型,该模型通过考虑到特征聚合过程中模式依赖性,学习了更多内容的多式表达方式。具体地说,拟议的模型利用两阶段图形构建方法将多式联运投入将模式依赖性编码为对话代表形式。我们通过将情绪状态投射到2D值振动(VA)子空间,核实了拟议方法的可解释能力。广泛的实验表明,我们提议的模型对于更准确的AER系统的有效性,这在两个公共数据集(IEMOCAP和MELD)上产生了最先进的结果。