Multimodal speech emotion recognition aims to detect speakers' emotions from audio and text. Prior works mainly focus on exploiting advanced networks to model and fuse different modality information to facilitate performance, while neglecting the effect of different fusion strategies on emotion recognition. In this work, we consider a simple yet important problem: how to fuse audio and text modality information is more helpful for this multimodal task. Further, we propose a multimodal emotion recognition model improved by perspective loss. Empirical results show our method obtained new state-of-the-art results on the IEMOCAP dataset. The in-depth analysis explains why the improved model can achieve improvements and outperforms baselines.
翻译:多模态语音情感识别旨在从音频和文本中检测说话者的情感。以往的研究主要关注于利用先进的网络来模拟和融合不同模态的信息以提高性能,而忽视了不同融合策略对情感识别的影响。本文考虑了一个简单但重要的问题:如何融合音频和文本的信息对于多模态任务更有帮助。此外,我们提出了一种通过透视损失改进的多模态情感识别模型。实验结果表明,我们的方法在IEMOCAP数据集上获得了新的最先进结果。深入的分析解释了为什么改进的模型可以实现改进并优于基线。