In psychoanalysis, generating interpretations to one's psychological state through visual creations is facing significant demands. The two main tasks of existing studies in the field of computer vision, sentiment/emotion classification and affective captioning, can hardly satisfy the requirement of psychological interpreting. To meet the demands for psychoanalysis, we introduce a challenging task, \textbf{V}isual \textbf{E}motion \textbf{I}nterpretation \textbf{T}ask (VEIT). VEIT requires AI to generate reasonable interpretations of creator's psychological state through visual creations. To support the task, we present a multimodal dataset termed SpyIn (\textbf{S}and\textbf{p}la\textbf{y} \textbf{In}terpretation Dataset), which is psychological theory supported and professional annotated. Dataset analysis illustrates that SpyIn is not only able to support VEIT, but also more challenging compared with other captioning datasets. Building on SpyIn, we conduct experiments of several image captioning method, and propose a visual-semantic combined model which obtains a SOTA result on SpyIn. The results indicate that VEIT is a more challenging task requiring scene graph information and psychological knowledge. Our work also show a promise for AI to analyze and explain inner world of humanity through visual creations.
翻译:在心理分析中,通过视觉创作对一个人的心理状态产生解释,正面临着巨大的需求。在计算机视觉、情绪/情绪分类和情感字幕领域的现有研究的两个主要任务中,计算机视觉、情绪/情绪分类和情感字幕领域的两种主要任务都很难满足心理解释的要求。为了满足心理分析的要求,我们引入了一个具有挑战性的任务,即\ textbf{V}V}sual {Vitual {textbf{E}motion\ textbf{I}sterpectation \ textb{Tyask (VeIT).VeIT需要AI通过视觉创作对创造者的心理状态作出合理的解释。为了支持这项任务,我们推出一个名为SpyIn(\ textbf{S}和\ textb{p}la\ textbb{pf{y}\ textb{y} 这样的数据集,我们引入了一种具有挑战性的结果, 通过直观和直观的图来展示一个具有挑战性的结果。</s>