There has been a recent explosion of impressive generative models that can produce high quality images (or videos) conditioned on text descriptions. However, all such approaches rely on conditional sentences that contain unambiguous descriptions of scenes and main actors in them. Therefore employing such models for more complex task of story visualization, where naturally references and co-references exist, and one requires to reason about when to maintain consistency of actors and backgrounds across frames/scenes, and when not to, based on story progression, remains a challenge. In this work, we address the aforementioned challenges and propose a novel autoregressive diffusion-based framework with a visual memory module that implicitly captures the actor and background context across the generated frames. Sentence-conditioned soft attention over the memories enables effective reference resolution and learns to maintain scene and actor consistency when needed. To validate the effectiveness of our approach, we extend the MUGEN dataset and introduce additional characters, backgrounds and referencing in multi-sentence storylines. Our experiments for story generation on the MUGEN, the PororoSV and the FlintstonesSV dataset show that our method not only outperforms prior state-of-the-art in generating frames with high visual quality, which are consistent with the story, but also models appropriate correspondences between the characters and the background.
翻译:最近出现了令人印象深刻的基因化模型的爆炸,这些模型能够产生高质量的图像(或视频),以文字描述为条件,然而,所有这类方法都依赖于含有清晰描述场景和其中主要行为者的有条件的句子。因此,利用这些模型进行更复杂的故事直观化任务,在这种任务中自然存在参考和共同参照和共同参照,人们需要了解何时保持跨框架/场景的行为者和背景的一致性,而如果不根据故事进展,则在多主题故事线中引入更多的字符、背景和参考,这仍然是一个挑战。在这项工作中,我们应对上述挑战,并提议一个具有视觉记忆模块的新颖的自动反向扩散框架,以隐含地捕捉生成的场景和背景背景背景。对记忆的软化关注使得能够有效的参考分辨率,并学会在必要时保持场景和演员的一致性。为了验证我们的方法的有效性,我们扩展MUGEN数据集,在多主题故事线上引入更多的字符、背景和参考。我们在MUGEN、PororoSV和FlinterstoneSV数据集上进行的故事制作实验,我们的方法不仅超越了我们的方法,而且不仅超越了先前的图像和图像的高级背景,而且与前的图像背景之间也保持了相同的格式。