Emotion recognition in conversations is challenging due to the multi-modal nature of the emotion expression. We propose a hierarchical cross-attention model (HCAM) approach to multi-modal emotion recognition using a combination of recurrent and co-attention neural network models. The input to the model consists of two modalities, i) audio data, processed through a learnable wav2vec approach and, ii) text data represented using a bidirectional encoder representations from transformers (BERT) model. The audio and text representations are processed using a set of bi-directional recurrent neural network layers with self-attention that converts each utterance in a given conversation to a fixed dimensional embedding. In order to incorporate contextual knowledge and the information across the two modalities, the audio and text embeddings are combined using a co-attention layer that attempts to weigh the utterance level embeddings relevant to the task of emotion recognition. The neural network parameters in the audio layers, text layers as well as the multi-modal co-attention layers, are hierarchically trained for the emotion classification task. We perform experiments on three established datasets namely, IEMOCAP, MELD and CMU-MOSI, where we illustrate that the proposed model improves significantly over other benchmarks and helps achieve state-of-art results on all these datasets.
翻译:在对话中进行情绪识别是具有挑战性的,因为情感表达具有多模式特性。我们使用循环-共同关注的神经网络模型组合提出了一种分层交叉关注模型(HCAM)方法来进行多模式情绪识别。模型的输入由两种模态组成:i)使用可学习wav2vec方法处理的音频数据,ii)使用双向编码器来自变压器(BERT)模型表示的文本数据。音频和文本表示使用一组带有自我注意力的双向循环神经网络层进行处理,将给定对话中的每个话语转换为固定维度的嵌入。为了将上下文知识和两种模态的信息结合起来,使用共同关注层将音频和文本嵌入组合,试图对与情绪识别任务相关的话语级嵌入进行加权。音频层、文本层以及多模态共同关注层中的神经网络参数通过分层训练用于情感分类任务。我们在三个已建立的数据集上进行实验,分别为IEMOCAP,MELD和CMU-MOSI,结果表明所提出的模型在所有这些数据集上都显著优于其他基准,并帮助实现了最先进的结果。