Recent Transformer-based contextual word representations, including BERT and XLNet, have shown state-of-the-art performance in multiple disciplines within NLP. Fine-tuning the trained contextual models on task-specific datasets has been the key to achieving superior performance downstream. While fine-tuning these pre-trained models is straightforward for lexical applications (applications with only language modality), it is not trivial for multimodal language (a growing area in NLP focused on modeling face-to-face communication). Pre-trained models don't have the necessary components to accept two extra modalities of vision and acoustic. In this paper, we proposed an attachment to BERT and XLNet called Multimodal Adaptation Gate (MAG). MAG allows BERT and XLNet to accept multimodal nonverbal data during fine-tuning. It does so by generating a shift to internal representation of BERT and XLNet; a shift that is conditioned on the visual and acoustic modalities. In our experiments, we study the commonly used CMU-MOSI and CMU-MOSEI datasets for multimodal sentiment analysis. Fine-tuning MAG-BERT and MAG-XLNet significantly boosts the sentiment analysis performance over previous baselines as well as language-only fine-tuning of BERT and XLNet. On the CMU-MOSI dataset, MAG-XLNet achieves human-level multimodal sentiment analysis performance for the first time in the NLP community.
翻译:最近基于变异器的背景字表,包括BERT和XLNet,显示了NLP内部多个学科的最新水平表现。在任务特定数据集方面,对经过培训的背景模型进行微调是下游取得优异性业绩的关键。虽然对这些预先培训的模型进行微调对于词汇应用(仅使用语言模式的应用)来说是直截了当的,但对多式联运语言来说并非微不足道(NLP中日益注重面对面交流模式的日益扩大的领域)。在经过培训的模型中,没有接受两种额外的视觉和声学模式的必要组成部分。在本文件中,我们提议对BERT和XLNet附加一个称为多式适应门(MAG)。MAG和XLNet允许BERT和XLNet在微调时接受多式非口头数据。它通过将BERT和XLNet的内部代表性转换为BERT和MAG-MUMUMU数据模型;以视觉和声调模式为条件的转变。我们实验中,我们用通用的CMU-MOSI和CMU-MOSEI数据集进行多式联运情绪分析。将MAG-B-BERT和MAG-XMER-SIMSI-SI-M-M-I-I-G-I-I-I-Simal-T-T-GSimal-T-I-Simal-imal-imal-imal-SA-SA-SA/GSAGS-T-T-T-T-S-GM-T-T-T-G-T-T-T-T-G-GMIS-G-G-S-I-S-S-T-T-T-T-T-T-T-T-S-T-T-T-T-S-S-GM-G-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-GMIS-T-T-T-T-T-T-T-T-G-G-G-G-T-T-G-G-T-G-G-T-T-T-T-T-T-T-T-T-T-T-G-G-G-G-G-