使用具有跨模式关注的大型培训前模型来认识多模式情感 (Using Large Pre-Trained Models with Cross-Modal Attention for Multi-Modal Emotion Recognition)

Recently, self-supervised pre-training has shown significant improvements in many areas of machine learning, including speech and NLP. We propose using large self-supervised pre-trained models for both audio and text modality with cross-modality attention for multimodal emotion recognition. We use Wav2Vec2.0 [1] as an audio encoder base for robust speech features extraction and the BERT model [2] as a text encoder base for better contextual representation of text. These high capacity models trained on large amounts of unlabeled data contain rich feature representations and improve the downstream task's performance. We use the cross-modal attention [3] mechanism to learn alignment between audio and text representations from self-supervised models. Cross-modal attention also helps in extracting interactive information between audio and text features. We obtain utterance-level feature representation from frame-level features using statistics pooling for both audio and text modality and combine them using the early fusion technique. Our experiments show that the proposed approach obtains a 1.88% absolute improvement in accuracy compared to the previous state-of-the-art method [3] on the IEMOCAP dataset [35]. We also conduct unimodal experiments for both audio and text modalities and compare them with previous best methods.

翻译：最近,自我监督的培训前培训在许多机器学习领域,包括语言和NLP,都显示出了显著的改进。我们建议采用大型自监督的预先培训模式,在多式情绪识别方面采用具有超现代注意力的音频和文本模式。我们使用Wav2Vec2.0 [1]作为音频编码基础,进行强力语音特征提取,BERT模式[2]作为文本编码基础,以更好地反映文字背景。这些在大量未贴标签数据方面受过培训的高能力模式包含丰富的特征描述,并改进下游任务的业绩。我们利用跨式关注[3]机制学习自监督模式的音频和文本表达方法之间的相互协调。跨式关注也有助于在音频和文本功能之间获取互动信息。我们利用声音和文本模式的统计集合,从框架层面获得特征代表,并使用早期融合技术。我们的实验表明,与以往的无艺术方法相比,在准确性方面获得了1.88%的绝对改进。我们还利用了以往的IMOCAP数据模型和最佳数据模型方法进行。

相关内容

注意力机制

关注 120

Attention机制最早是在视觉图像领域提出来的，但是真正火起来应该算是google mind团队的这篇论文《Recurrent Models of Visual Attention》[14]，他们在RNN模型上使用了attention机制来进行图像分类。随后，Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中，使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行，他们的工作算是是第一个提出attention机制应用到NLP领域中。接着类似的基于attention机制的RNN模型扩展开始应用到各种NLP任务中。最近，如何在CNN中使用attention机制也成为了大家的研究热点。下图表示了attention研究进展的大概趋势。

【ACM Multimedia2020】跨模态注意力Transformer模型的深度视频理解

专知会员服务

75+阅读 · 2020年8月30日

【ACL2020-亚马逊】Transformers多分辨率和多模态语音识别，Multiresolution and Multimodal Speech Recognition with Transformers

专知会员服务

15+阅读 · 2020年5月5日

【CVPR2020-牛津-谷歌】语音到动作:动作识别的跨模态监督，Cross-modal Supervision

专知会员服务

24+阅读 · 2020年3月31日

【北京大学】探索提取跨模态信息进行图像caption，Exploring and Distilling Cross-Modal Information for Image Captioning

专知会员服务

54+阅读 · 2020年3月3日