Emotion Recognition (ER) aims to classify human utterances into different emotion categories. Based on early-fusion and self-attention-based multimodal interaction between text and acoustic modalities, in this paper, we propose a multimodal multitask learning approach for ER from individual utterances in isolation. Experiments on the IEMOCAP benchmark show that our proposed model performs better than our re-implementation of state-of-the-art and achieves better performance than all other unimodal and multimodal approaches in literature. In addition, strong baselines and ablation studies prove the effectiveness of our proposed approach. We make all our codes publicly available on GitHub.
翻译:情感认知(ER)旨在将人类的言论分为不同的情感类别;基于文字和声学模式之间的早期融合和基于自我注意的多式互动,我们在本文件中建议对孤立的个人言论采用多式多任务学习方法;对IMOCAP基准的实验表明,我们提议的模型比我们重新实施最新技术,并且比所有其他单一方式和多式文学方法表现得更好;此外,强有力的基线和通缩研究证明了我们拟议方法的有效性。我们在GitHub公布了我们的所有代码。