Characters are essential to the plot of any story. Establishing the characters before writing a story can improve the clarity of the plot and the overall flow of the narrative. However, previous work on visual storytelling tends to focus on detecting objects in images and discovering relationships between them. In this approach, characters are not distinguished from other objects when they are fed into the generation pipeline. The result is a coherent sequence of events rather than a character-centric story. In order to address this limitation, we introduce the VIST-Character dataset, which provides rich character-centric annotations, including visual and textual co-reference chains and importance ratings for characters. Based on this dataset, we propose two new tasks: important character detection and character grounding in visual stories. For both tasks, we develop simple, unsupervised models based on distributional similarity and pre-trained vision-and-language models. Our new dataset, together with these models, can serve as the foundation for subsequent work on analysing and generating stories from a character-centric perspective.
翻译:角色是任何故事情节的重要组成部分。在写故事之前确定角色可以提高情节的清晰度和整个叙述的流畅性。然而,以往在视觉叙事方面的研究往往集中于检测图像中的物体并发现它们之间的关系。在这种方法中,当角色作为对象输入到生成管道中时,它们与其他对象没有什么区别。结果是一个连贯的事件序列而不是以角色为中心的故事。为了解决这个缺陷,我们引入了VIST-Character数据集,其中提供了丰富的角色为中心的注释,包括视觉和文本共指链和角色的重要性评级。基于这个数据集,我们提出了两个新任务:重要角色的检测和角色在视觉故事中的定位。对于这两个任务,我们开发了基于分布相似性和预训练的视觉和语言模型的简单无监督模型。我们的新数据集以及这些模型可以为后续从角色为中心的角度分析和生成故事奠定基础。