Image narrative generation describes the creation of stories regarding the content of image data from a subjective viewpoint. Given the importance of the subjective feelings of writers, characters, and readers in storytelling, image narrative generation methods must consider human emotion, which is their major difference from descriptive caption generation tasks. The development of automated methods to generate story-like text associated with images may be considered to be of considerable social significance, because stories serve essential functions both as entertainment and also for many practical purposes such as education and advertising. In this study, we propose a model called ViNTER (Visual Narrative Transformer with Emotion arc Representation) to generate image narratives that focus on time series representing varying emotions as "emotion arcs," to take advantage of recent advances in multimodal Transformer-based pre-trained models. We present experimental results of both manual and automatic evaluations, which demonstrate the effectiveness of the proposed emotion-aware approach to image narrative generation.
翻译:图像叙事生成从主观角度描述关于图像数据内容的故事。 鉴于作家、人物和读者主观情感在讲故事中的重要性,图像叙事生成方法必须考虑到人类情感,这是他们与描述性字幕生成任务的主要区别。 开发自动生成与图像相关的类似故事文字的方法可被视为具有相当大的社会意义,因为故事既作为娱乐功能,也为教育和广告等许多实用目的发挥基本功能。在本研究中,我们提出了一个名为ViNTER(具有情感弧代表的视觉描述变异器)的模式,以生成侧重于代表不同情感的时间序列的图像叙事,如“情感弧”,以利用基于多式联运的变异器预培训模型的最新进展。我们介绍了手动和自动评估的实验结果,这些结果显示了拟议中的情感感知生成图像描述方法的有效性。