We present a novel large-scale dataset and accompanying machine learning models aimed at providing a detailed understanding of the interplay between visual content, its emotional effect, and explanations for the latter in language. In contrast to most existing annotation datasets in computer vision, we focus on the affective experience triggered by visual artworks and ask the annotators to indicate the dominant emotion they feel for a given image and, crucially, to also provide a grounded verbal explanation for their emotion choice. As we demonstrate below, this leads to a rich set of signals for both the objective content and the affective impact of an image, creating associations with abstract concepts (e.g., "freedom" or "love"), or references that go beyond what is directly visible, including visual similes and metaphors, or subjective references to personal experiences. We focus on visual art (e.g., paintings, artistic photographs) as it is a prime example of imagery created to elicit emotional responses from its viewers. Our dataset, termed ArtEmis, contains 439K emotion attributions and explanations from humans, on 81K artworks from WikiArt. Building on this data, we train and demonstrate a series of captioning systems capable of expressing and explaining emotions from visual stimuli. Remarkably, the captions produced by these systems often succeed in reflecting the semantic and abstract content of the image, going well beyond systems trained on existing datasets. The collected dataset and developed methods are available at https://artemisdataset.org.
翻译:我们展示了一个新的大型数据集和相伴的机器学习模型,旨在详细了解视觉内容、其情感影响和语言对视觉内容的解释之间的相互作用。与计算机视觉中大多数现有的注释数据集相比,我们侧重于视觉艺术作品引发的感官体验,请说明他们对于特定图像的主导情感,并关键地提供对其情感选择的有根有据的口头解释。如下所示,这导致对图像客观内容和影响影响产生一套丰富的信号,建立具有抽象概念(例如“自由数据”或“love”)的关联,或超越直接可见的参考,包括视觉比喻和隐喻,或个人经历的主观引用。我们侧重于视觉艺术(例如,绘画,艺术照片),因为它是图象所创造的典型例子,以吸引读者的情感反应。我们的数据集,称为ArtEMis,包含439K的情感归属和解释,以及来自Wikartartsetril 的81K艺术作品, 展示了这些经过训练的图像系统, 展示了这些能被训练的图像的图理的系统。