Authors of posts in social media communicate their emotions and what causes them with text and images. While there is work on emotion and stimulus detection for each modality separately, it is yet unknown if the modalities contain complementary emotion information in social media. We aim at filling this research gap and contribute a novel, annotated corpus of English multimodal Reddit posts. On this resource, we develop models to automatically detect the relation between image and text, an emotion stimulus category and the emotion class. We evaluate if these tasks require both modalities and find for the image-text relations, that text alone is sufficient for most categories (complementary, illustrative, opposing): the information in the text allows to predict if an image is required for emotion understanding. The emotions of anger and sadness are best predicted with a multimodal model, while text alone is sufficient for disgust, joy, and surprise. Stimuli depicted by objects, animals, food, or a person are best predicted by image-only models, while multimodal models are most effective on art, events, memes, places, or screenshots.
翻译:社交媒体文章的作者们用文字和图像来表达他们的情感,以及导致他们的情绪和什么原因。虽然在情感和刺激探测方面有关于每种模式的单独工作,但尚不清楚模式是否包含社交媒体中的情感补充信息。我们的目标是填补这一研究空白,并贡献出一本小说、附加注释的英国多式联运Reddit文章。我们在这个资源上开发了自动检测图像和文字之间的关系、情感刺激类别和情感类的模型。我们评估这些任务是否既需要模式又需要图像-文字关系,我们评估这些任务是否对大多数类别(补充、说明、反对):文本中的信息能够预测是否需要图像来理解情感。愤怒和悲伤情绪最好用多式模型预测,而光是文字就足以引起厌恶、喜悦和惊喜。用物体、动物、食物或个人描述的刺激性模型最好用只图像模型来预测,而多式模型对艺术、事件、迷喻、地点或剪辑最为有效。