Emotion Recognition (ER) aims to classify human utterances into different emotion categories. Based on early-fusion and self-attention-based multimodal interaction between text and acoustic modalities, in this paper, we propose MMER, a multimodal multitask learning approach for ER from individual utterances in isolation. Our proposed MMER leverages a multimodal dynamic fusion network that adds minimal parameters over an existing speech encoder to leverage the semantic and syntactic properties hidden in text. Experiments on the IEMOCAP benchmark show that our proposed model achieves state-of-the-art performance. In addition, strong baselines and ablation studies prove the effectiveness of our proposed approach. We make our code publicly available on GitHub.
翻译:情感认知(ER)旨在将人的言语分为不同的情感类别。基于文字和声学模式之间的早期融合和自我关注多式互动,我们在本文件中提议MMER,这是孤立地个别言语中ER的多式多任务学习方法。我们提议的MEMER利用一个多式动态聚合网络,在现有语音编码器上增加最低参数,以利用文本中隐藏的语义和合成特性。关于IEMOC基准的实验显示,我们提议的模型达到了最新性能。此外,强有力的基线和通缩研究证明了我们拟议方法的有效性。我们在GitHub上公布了我们的代码。