As an important task in sentiment analysis, Multimodal Aspect-Based Sentiment Analysis (MABSA) has attracted increasing attention in recent years. However, previous approaches either (i) use separately pre-trained visual and textual models, which ignore the crossmodal alignment or (ii) use vision-language models pre-trained with general pre-training tasks, which are inadequate to identify finegrained aspects, opinions, and their alignments across modalities. To tackle these limitations, we propose a task-specific Vision-Language Pre-training framework for MABSA (VLPMABSA), which is a unified multimodal encoder-decoder architecture for all the pretraining and downstream tasks. We further design three types of task-specific pre-training tasks from the language, vision, and multimodal modalities, respectively. Experimental results show that our approach generally outperforms the state-of-the-art approaches on three MABSA subtasks. Further analysis demonstrates the effectiveness of each pretraining task. The source code is publicly released at https://github.com/NUSTM/VLP-MABSA.
翻译:作为情感分析的一项重要任务,近年来基于多模式的感知分析(MABSA)已引起越来越多的注意,然而,以往的办法有:(一) 单独使用经过训练的预科前视觉和文字模型,忽略了交叉模式的配合,或(二) 使用经过训练前一般任务的一般训练前训练的视觉语言模型,这些模型不足以辨别精细的方面、意见和各种模式的配合。为克服这些限制,我们提议为MABSA(VLPMABSA)提出一个任务特定的设想语言前训练框架,这是用于所有训练前和下游任务的统一多式联运编码解码结构。我们进一步从语言、愿景和模式分别设计三种具体任务前训练任务前的任务。实验结果表明,我们的办法一般比MABSA的三个子任务中的最新方法要差。进一步分析表明每项训练前任务的有效性。源代码在https://github.com/NustimM/VLP-MABSA上公开发布。