Accurately detecting emotions in conversation is a necessary yet challenging task due to the complexity of emotions and dynamics in dialogues. The emotional state of a speaker can be influenced by many different factors, such as interlocutor stimulus, dialogue scene, and topic. In this work, we propose a conversational speech emotion recognition method to deal with capturing attentive contextual dependency and speaker-sensitive interactions. First, we use a pretrained VGGish model to extract segment-based audio representation in individual utterances. Second, an attentive bi-directional gated recurrent unit (GRU) models contextual-sensitive information and explores intra- and inter-speaker dependencies jointly in a dynamic manner. The experiments conducted on the standard conversational dataset MELD demonstrate the effectiveness of the proposed method when compared against state-of the-art methods.
翻译:由于对话中情绪和动态的复杂性,在对话中准确发现情感是一项必要而具有挑战性的任务。发言者的情绪状态可能受到许多不同因素的影响,如对话者刺激、对话场景和议题。在这项工作中,我们建议一种对话性言语情感识别方法,以捕捉关注背景依赖和对演讲者敏感的互动。首先,我们使用预先训练的VGGIish模型,在个人言论中提取基于分段的音频。第二,一个关注双向双向锁定的经常单位(GRU)模型,背景敏感信息模型,以动态方式共同探索交谈者内部和交谈者之间的依赖性。在标准对话数据集MELD上进行的实验表明,与最新方法相比,拟议方法的有效性。