Computational modeling of the emotions evoked by art in humans is a challenging problem because of the subjective and nuanced nature of art and affective signals. In this paper, we consider the above-mentioned problem of understanding emotions evoked in viewers by artwork using both text and visual modalities. Specifically, we analyze images and the accompanying text captions from the viewers expressing emotions as a multimodal classification task. Our results show that single-stream multimodal transformer-based models like MMBT and VisualBERT perform better compared to both image-only models and dual-stream multimodal models having separate pathways for text and image modalities. We also observe improvements in performance for extreme positive and negative emotion classes, when a single-stream model like MMBT is compared with a text-only transformer model like BERT.
翻译:由于艺术和感官信号的主观性和细微性质,对艺术在人类中产生的情感进行计算模型是一个具有挑战性的问题。在本文中,我们考虑了上述利用文字和视觉模式艺术作品在观众中产生的理解情绪问题。具体地说,我们分析作为多式联运分类任务表达情感的观众的图像和所附文字说明。我们的结果表明,单流多式联运变压器模型,如MMBT和视觉变压器模型,与只显示图像的模型和具有文本和图像模式不同途径的双流多式联运模型相比,效果更好。我们还观察到极端正反情绪类和反情绪类的性能有所改善,而MMBT这样的单流模型则与只显示文本的变压器模型相比,如BERT。