The research and applications of multimodal emotion recognition have become increasingly popular recently. However, multimodal emotion recognition faces the challenge of lack of data. To solve this problem, we propose to use transfer learning which leverages state-of-the-art pre-trained models including wav2vec 2.0 and BERT for this task. Multi-level fusion approaches including coattention-based early fusion and late fusion with the models trained on both embeddings are explored. Also, a multi-granularity framework which extracts not only frame-level speech embeddings but also segment-level embeddings including phone, syllable and word-level speech embeddings is proposed to further boost the performance. By combining our coattention-based early fusion model and late fusion model with the multi-granularity feature extraction framework, we obtain result that outperforms best baseline approaches by 1.3% unweighted accuracy (UA) on the IEMOCAP dataset.
翻译:最近,多式联运情感识别的研究和应用最近越来越受欢迎。然而,多式联运情感识别面临缺乏数据的挑战。为了解决这一问题,我们提议利用转让学习,利用先进的预先培训模型,包括 wav2vec 2.0 和 BERT 来完成这项任务。多级聚合方法,包括以涂层为基础的早期聚变和与在两种嵌入方面受过训练的模型的延迟融合。此外,还探索了多级混合方法,不仅提取框架级语言嵌入,而且还提取包括电话、可调和字级语音嵌入在内的分层层嵌入,以进一步提高性能。通过将我们基于涂层的早期聚变模型和迟聚变模型与多级特征提取框架相结合,我们取得了以下结果:通过1.3%的未加权精确度(UA)在IEMOCAP数据集上将最佳基线方法排出1.3%。