Visual-textual sentiment analysis aims to predict sentiment with the input of a pair of image and text. The main challenge of visual-textual sentiment analysis is how to learn effective visual features for sentiment prediction since input images are often very diverse. To address this challenge, we propose a new method that improves visual-textual sentiment analysis by introducing powerful expert visual features. The proposed method consists of four parts: (1) a visual-textual branch to learn features directly from data for sentiment analysis, (2) a visual expert branch with a set of pre-trained "expert" encoders to extract effective visual features, (3) a CLIP branch to implicitly model visual-textual correspondence, and (4) a multimodal feature fusion network based on either BERT or MLP to fuse multimodal features and make sentiment prediction. Extensive experiments on three datasets show that our method produces better visual-textual sentiment analysis performance than existing methods.
翻译:视觉-文字感知分析的目的是通过一对图像和文字的输入来预测情绪。视觉-文字感知分析的主要挑战是如何学习感知预测的有效视觉特征,因为输入图像往往非常多样。为了应对这一挑战,我们建议了一种新方法,通过引入强大的专家视觉特征来改进视觉-文字感知分析。拟议方法包括四个部分:(1) 直接从感知分析数据中学习特征的视觉-文字分支;(2) 具有一套事先经过训练的“专家”编译器以提取有效视觉特征的视觉专家分支;(3) 以隐含的视觉-文字通信模式为CLIP分支;(4) 以BERT或MLP为基础的多式特征聚合网络,以融合多式特征并作出感知预测。关于三个数据集的广泛实验表明,我们的方法比现有方法产生更好的视觉-文字感知分析性能。