We propose a novel approach to multimodal sentiment analysis using deep neural networks combining visual analysis and natural language processing. Our goal is different than the standard sentiment analysis goal of predicting whether a sentence expresses positive or negative sentiment; instead, we aim to infer the latent emotional state of the user. Thus, we focus on predicting the emotion word tags attached by users to their Tumblr posts, treating these as "self-reported emotions." We demonstrate that our multimodal model combining both text and image features outperforms separate models based solely on either images or text. Our model's results are interpretable, automatically yielding sensible word lists associated with emotions. We explore the structure of emotions implied by our model and compare it to what has been posited in the psychology literature, and validate our model on a set of images that have been used in psychology studies. Finally, our work also provides a useful tool for the growing academic study of images - both photographs and memes - on social networks.
翻译:我们提出一种新颖的多式情绪分析方法,使用深度神经网络,结合视觉分析和自然语言处理。我们的目标不同于标准情绪分析目标,即预测一个句子是否表示积极或消极情绪;相反,我们的目标是推断用户潜在的情绪状态。因此,我们侧重于预测用户附在Tumblr 文章上的情绪字条标记,将这些标记视为“自我报告情绪 ” 。我们证明我们将文本和图像特征相结合的多式模型优于完全基于图像或文字的不同模型。我们的模型结果是可以解释的,自动产生与情感相关的合理单词清单。我们探索了模型所隐含的情绪结构,并将其与心理学文献中已经显示的内容进行比较,并验证了我们在心理学研究中使用的一组图像的模型。最后,我们的工作还为社会网络上不断增长的图像学术研究提供了有用的工具,包括照片和中位图。