StoryDALL-E: 适应用于连续运行的未经训练的文本到图像变换器 (StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation)

Recent advances in text-to-image synthesis have led to large pretrained transformers with excellent capabilities to generate visualizations from a given text. However, these models are ill-suited for specialized tasks like story visualization, which requires an agent to produce a sequence of images given a corresponding sequence of captions, forming a narrative. Moreover, we find that the story visualization task fails to accommodate generalization to unseen plots and characters in new narratives. Hence, we first propose the task of story continuation, where the generated visual story is conditioned on a source image, allowing for better generalization to narratives with new characters. Then, we enhance or 'retro-fit' the pretrained text-to-image synthesis models with task-specific modules for (a) sequential image generation and (b) copying relevant elements from an initial frame. Then, we explore full-model finetuning, as well as prompt-based tuning for parameter-efficient adaptation, of the pre-trained model. We evaluate our approach StoryDALL-E on two existing datasets, PororoSV and FlintstonesSV, and introduce a new dataset DiDeMoSV collected from a video-captioning dataset. We also develop a model StoryGANc based on Generative Adversarial Networks (GAN) for story continuation, and compare it with the StoryDALL-E model to demonstrate the advantages of our approach. We show that our retro-fitting approach outperforms GAN-based models for story continuation and facilitates copying of visual elements from the source image, thereby improving continuity in the generated visual story. Finally, our analysis suggests that pretrained transformers struggle to comprehend narratives containing several characters. Overall, our work demonstrates that pretrained text-to-image synthesis models can be adapted for complex and low-resource tasks like story continuation.

翻译：文本到图像合成的最近进步导致大量先入为主的变压器,这些变压器具有极佳的能力从给定文本生成可视化。然而, 这些模型不适合像故事直观化这样的专门任务, 这要求一个代理器来制作一系列图像, 给相应的标题序列, 形成一个叙事。此外, 我们发现故事直观化任务无法在新的叙事中将一般化适用于看不见的图纸和字符。因此, 我们首先提议故事的延续任务, 由此生成的视觉故事以源图像为条件, 使生成的视觉故事以新字符的描述更清晰化为条件。然后, 我们强化或“ 重新适应” 预培训的文本到图像直观合成模型的合成模型模型模型, 并引入一个任务模块化的图像化的图像化合成模型。然后我们探索全模版的微调整, 以及快速调整参数效率的适应模式。我们用StoryDALL- E 方法改进了现有两个数据源集的“ PororomSV ” 和“ FlentstoneSV ” 。我们用直观- real- real- deal- deal- deal- deal- devidustrational- drodustrational

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日