Recently, self-supervised pre-training has shown significant improvements in many areas of machine learning, including speech and NLP. We propose using large self-supervised pre-trained models for both audio and text modality with cross-modality attention for multimodal emotion recognition. We use Wav2Vec2.0 [1] as an audio encoder base for robust speech features extraction and the BERT model [2] as a text encoder base for better contextual representation of text. These high capacity models trained on large amounts of unlabeled data contain rich feature representations and improve the downstream task's performance. We use the cross-modal attention [3] mechanism to learn alignment between audio and text representations from self-supervised models. Cross-modal attention also helps in extracting interactive information between audio and text features. We obtain utterance-level feature representation from frame-level features using statistics pooling for both audio and text modality and combine them using the early fusion technique. Our experiments show that the proposed approach obtains a 1.88% absolute improvement in accuracy compared to the previous state-of-the-art method [3] on the IEMOCAP dataset [35]. We also conduct unimodal experiments for both audio and text modalities and compare them with previous best methods.
翻译:最近,自我监督的培训前培训在许多机器学习领域,包括语言和NLP,都显示出了显著的改进。我们建议采用大型自监督的预先培训模式,在多式情绪识别方面采用具有超现代注意力的音频和文本模式。我们使用Wav2Vec2.0 [1]作为音频编码基础,进行强力语音特征提取,BERT模式[2]作为文本编码基础,以更好地反映文字背景。这些在大量未贴标签数据方面受过培训的高能力模式包含丰富的特征描述,并改进下游任务的业绩。我们利用跨式关注[3]机制学习自监督模式的音频和文本表达方法之间的相互协调。跨式关注也有助于在音频和文本功能之间获取互动信息。我们利用声音和文本模式的统计集合,从框架层面获得特征代表,并使用早期融合技术。我们的实验表明,与以往的无艺术方法相比,在准确性方面获得了1.88%的绝对改进。我们还利用了以往的IMOCAP数据模型和最佳数据模型方法进行。