Multimodal emotion recognition has attracted much attention recently. Fusing multiple modalities effectively with limited labeled data is a challenging task. Considering the success of pre-trained model and fine-grained nature of emotion expression, it is reasonable to take these two aspects into consideration. Unlike previous methods that mainly focus on one aspect, we introduce a novel multi-granularity framework, which combines fine-grained representation with pre-trained utterance-level representation. Inspired by Transformer TTS, we propose a multilevel transformer model to perform fine-grained multimodal emotion recognition. Specifically, we explore different methods to incorporate phoneme-level embedding with word-level embedding. To perform multi-granularity learning, we simply combine multilevel transformer model with Albert. Extensive experimental results show that both our multilevel transformer model and multi-granularity model outperform previous state-of-the-art approaches on IEMOCAP dataset with text transcripts and speech signal.
翻译:最近,多模式情感的认知引起了很多关注。 以有限的标签数据有效运用多种模式是一项艰巨的任务。 考虑到预先培训的模型的成功和情感表达的精细精细分析性质, 将这两个方面考虑在内是合理的。 与以前主要侧重于一个方面的方法不同, 我们引入了一个新的多语言框架, 将精细区分的表达方式与预先培训的发声水平代表方式结合起来。 在变异器 TTS的启发下, 我们提出一个多层次的变异器模型, 以进行精细区分的多式联运情感识别。 具体地说, 我们探索了将语音层嵌入文字层的不同方法。 为了进行多语言化学习, 我们简单地将多层次变异器模型与艾伯特结合起来。 广泛的实验结果表明, 我们的多层次变异器模型和多层次变异模型都超越了先前在IEMOCAP数据集上采用的最新方法, 包括文字记录和语音信号。