Recent advances in text-to-image synthesis have led to large pretrained transformers with excellent capabilities to generate visualizations from a given text. However, these models are ill-suited for specialized tasks like story visualization, which requires an agent to produce a sequence of images given a corresponding sequence of captions, forming a narrative. Moreover, we find that the story visualization task fails to accommodate generalization to unseen plots and characters in new narratives. Hence, we first propose the task of story continuation, where the generated visual story is conditioned on a source image, allowing for better generalization to narratives with new characters. Then, we enhance or 'retro-fit' the pretrained text-to-image synthesis models with task-specific modules for (a) sequential image generation and (b) copying relevant elements from an initial frame. Then, we explore full-model finetuning, as well as prompt-based tuning for parameter-efficient adaptation, of the pre-trained model. We evaluate our approach StoryDALL-E on two existing datasets, PororoSV and FlintstonesSV, and introduce a new dataset DiDeMoSV collected from a video-captioning dataset. We also develop a model StoryGANc based on Generative Adversarial Networks (GAN) for story continuation, and compare it with the StoryDALL-E model to demonstrate the advantages of our approach. We show that our retro-fitting approach outperforms GAN-based models for story continuation and facilitates copying of visual elements from the source image, thereby improving continuity in the generated visual story. Finally, our analysis suggests that pretrained transformers struggle to comprehend narratives containing several characters. Overall, our work demonstrates that pretrained text-to-image synthesis models can be adapted for complex and low-resource tasks like story continuation.
翻译:文本到图像合成的最近进步导致大量先入为主的变压器,这些变压器具有极佳的能力从给定文本生成可视化。 然而, 这些模型不适合像故事直观化这样的专门任务, 这要求一个代理器来制作一系列图像, 给相应的标题序列, 形成一个叙事。 此外, 我们发现故事直观化任务无法在新的叙事中将一般化适用于看不见的图纸和字符。 因此, 我们首先提议故事的延续任务, 由此生成的视觉故事以源图像为条件, 使生成的视觉故事以新字符的描述更清晰化为条件。 然后, 我们强化或“ 重新适应” 预培训的文本到图像直观合成模型的合成模型模型模型, 并引入一个任务模块化的图像化的图像化合成模型 。 然后我们探索全模版的微调整, 以及快速调整参数效率的适应模式。 我们用StoryDALL- E 方法改进了现有两个数据源集的“ PororomSV ” 和“ FlentstoneSV ” 。 我们用直观- real- real- deal- deal- deal- deal- devidustrational- drodustrational