Emotion recognition is a challenging and actively-studied research area that plays a critical role in emotion-aware human-computer interaction systems. In a multimodal setting, temporal alignment between different modalities has not been well investigated yet. This paper presents a new model named as Gated Bidirectional Alignment Network (GBAN), which consists of an attention-based bidirectional alignment network over LSTM hidden states to explicitly capture the alignment relationship between speech and text, and a novel group gated fusion (GGF) layer to integrate the representations of different modalities. We empirically show that the attention-aligned representations outperform the last-hidden-states of LSTM significantly, and the proposed GBAN model outperforms existing state-of-the-art multimodal approaches on the IEMOCAP dataset.
翻译:情感认知是一个具有挑战性和积极研究的研究领域,在情感觉悟的人类-计算机互动系统中发挥着关键作用。在多式联运环境下,不同模式之间的时间匹配还没有很好地调查。本文件介绍了一个名为Gated双向对齐网络(GBAN)的新模式,它由基于关注的双向对齐网络组成,覆盖LSTM隐藏的状态,以明确体现言语和文字之间的匹配关系,以及一个新的组合组合(GGGGF)层,整合不同模式的表达方式。 我们从经验上表明,关注匹配的表达方式明显超越了LSTM最后隐藏状态,而拟议的GBAN模式则超越了IMOC数据集上现有的最先进的多式方法。