Story visualization is an under-explored task that falls at the intersection of many important research directions in both computer vision and natural language processing. In this task, given a series of natural language captions which compose a story, an agent must generate a sequence of images that correspond to the captions. Prior work has introduced recurrent generative models which outperform text-to-image synthesis models on this task. However, there is room for improvement of generated images in terms of visual quality, coherence and relevance. We present a number of improvements to prior modeling approaches, including (1) the addition of a dual learning framework that utilizes video captioning to reinforce the semantic alignment between the story and generated images, (2) a copy-transform mechanism for sequentially-consistent story visualization, and (3) MART-based transformers to model complex interactions between frames. We present ablation studies to demonstrate the effect of each of these techniques on the generative power of the model for both individual images as well as the entire narrative. Furthermore, due to the complexity and generative nature of the task, standard evaluation metrics do not accurately reflect performance. Therefore, we also provide an exploration of evaluation metrics for the model, focused on aspects of the generated frames such as the presence/quality of generated characters, the relevance to captions, and the diversity of the generated images. We also present correlation experiments of our proposed automated metrics with human evaluations. Code and data available at: https://github.com/adymaharana/StoryViz
翻译:故事视觉化是一项未得到充分探讨的任务,是计算机视觉和自然语言处理方面许多重要研究方向的交汇点。在这项任务中,鉴于一系列自然语言说明构成一个故事,一个代理必须产生一系列与标题对应的图像序列;先前的工作已经引入了反复出现的基因化模型,这些模型优于关于这项任务的文本到图像综合模型。然而,在视觉质量、一致性和相关性方面,产生的图像仍有改进的余地。我们介绍了先前的自动模型方法的一些改进,包括:(1) 增加一个双重学习框架,利用视频说明加强故事和生成图像之间的语义一致性;(2) 一个复制的、与顺序一致的故事直观视觉化的变异机制;(3) 以MART为基础的变异器,用于模拟各种框架之间的复杂互动。我们介绍了这些技术对模型在视觉质量、一致性和整个描述方面的影响。此外,由于任务的复杂性和基因化性质,标准评价指标指标没有准确反映图像和生成图像的语义性。因此,我们还提供了一个以数据质量为核心的模型/格式化的模型,我们所生成的模型在目前/格式上产生的数据质量的模型/格式方面。