The lack of data and the difficulty of multimodal fusion have always been challenges for multimodal emotion recognition (MER). In this paper, we propose to use pretrained models as upstream network, wav2vec 2.0 for audio modality and BERT for text modality, and finetune them in downstream task of MER to cope with the lack of data. For the difficulty of multimodal fusion, we use a K-layer multi-head attention mechanism as a downstream fusion module. Starting from the MER task itself, we design two auxiliary tasks to alleviate the insufficient fusion between modalities and guide the network to capture and align emotion-related features. Compared to the previous state-of-the-art models, we achieve a better performance by 78.42% Weighted Accuracy (WA) and 79.71% Unweighted Accuracy (UA) on the IEMOCAP dataset.
翻译:缺乏数据和多式融合的困难一直是多式情感识别的挑战。 在本文中,我们提议使用预先培训的模型作为上游网络,Wav2vec 2.0用于音频模式,BERT用于文本模式,并在MER下游任务中对其进行微调,以应对缺乏数据的问题。由于多式融合的困难,我们使用K-层次多头关注机制作为下游融合模块。从MER任务本身开始,我们设计了两项辅助任务,以缓解模式之间融合不足的情况,并指导网络捕捉和调整情感相关特征。与以往最先进的模型相比,我们在IEMOCAP数据集中实现了更好的业绩,即78.42%的加权准确度(WA)和79.71%的未加权准确性(UA)。</s>