Compared with unimodal data, multimodal data can provide more features to help the model analyze the sentiment of data. Previous research works rarely consider token-level feature fusion, and few works explore learning the common features related to sentiment in multimodal data to help the model fuse multimodal features. In this paper, we propose a Contrastive Learning and Multi-Layer Fusion (CLMLF) method for multimodal sentiment detection. Specifically, we first encode text and image to obtain hidden representations, and then use a multi-layer fusion module to align and fuse the token-level features of text and image. In addition to the sentiment analysis task, we also designed two contrastive learning tasks, label based contrastive learning and data based contrastive learning tasks, which will help the model learn common features related to sentiment in multimodal data. Extensive experiments conducted on three publicly available multimodal datasets demonstrate the effectiveness of our approach for multimodal sentiment detection compared with existing methods. The codes are available for use at https://github.com/Link-Li/CLMLF
翻译:与单式数据相比,多式联运数据可以提供更多的特征,帮助模型分析数据感知。以前的研究很少考虑象征性特征融合,很少研究学习多式联运数据中与情绪有关的共同特征,以帮助示范引信多式联运特征。在本文件中,我们建议采用矛盾学习和多阶段融合法(CLMLF)来检测多式情绪。具体地说,我们首先对文本和图像进行编码,以获得隐藏的表达方式,然后使用多层融合模块来统一和融合文本和图像的象征性特征。除了情感分析任务外,我们还设计了两种对比式学习任务,即以对比式学习为基础的标签和以对比式学习任务为基础的数据,这将有助于模型学习多式联运数据中与情绪相关的共同特征。对三种公开的多式联运数据集进行了广泛的实验,展示了与现有方法相比,我们多式联运情绪检测方法的有效性。这些代码可在https://github.com/Link-Li/CLMLFF中使用。