Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators (Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators)

Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets. In this paper, we introduce a new task of zero-shot text-to-video generation and propose a low-cost approach (without any training or optimization) by leveraging the power of existing text-to-image synthesis methods (e.g., Stable Diffusion), making them suitable for the video domain. Our key modifications include (i) enriching the latent codes of the generated frames with motion dynamics to keep the global scene and the background time consistent; and (ii) reprogramming frame-level self-attention using a new cross-frame attention of each frame on the first frame, to preserve the context, appearance, and identity of the foreground object. Experiments show that this leads to low overhead, yet high-quality and remarkably consistent video generation. Moreover, our approach is not limited to text-to-video synthesis but is also applicable to other tasks such as conditional and content-specialized video generation, and Video Instruct-Pix2Pix, i.e., instruction-guided video editing. As experiments show, our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data. Our code will be open sourced at: https://github.com/Picsart-AI-Research/Text2Video-Zero .

翻译：Text2Video-Zero：基于文本扩散模型的零样本视频生成器最近的文本到视频（Text-to-Video）生成方法依赖于计算密集型的训练，并需要大规模的视频数据集。在本文中，我们介绍了一种新任务：零样本文本到视频生成，并提出了一种低成本的方法（不需要任何训练或优化），利用现有的文本到图像（Text-to-Image）合成方法（例如 Stable Diffusion），使其适用于视频领域。我们的关键修改包括：（i）用运动动力学丰富生成帧的潜在代码，以保持全局场景和背景时间一致性；（ii）通过每一帧对第一帧的交叉式帧级自注意力重编程，来保留前景物体的上下文，外观和身份。实验表明，这导致了低开销，高质量和非常一致的视频生成。此外，我们的方法不仅限于文本到视频合成，还可以应用于其他任务，如条件和内容专业化的视频生成，以及 Video Instruct-Pix2Pix，即指导式视频编辑。正如实验所显示的那样，我们的方法在没有额外的视频数据训练的情况下，表现得与最近的方法相当或有时更好。我们的代码将在 https://github.com/Picsart-AI-Research/Text2Video-Zero 上开源。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【ICLR 2022】MIT论文解读：谈到人工智能，我们可以抛弃数据集吗？基于ML创建合成数据，Generative Models As A Data Source For Multiview Representation Learning

专知会员服务

41+阅读 · 2022年3月15日

【CVPR 2022】基于Tracklet查询和建议的高效视频实例分割，Efficient Video Instance Segmentation via Tracklet Query and Proposal

专知会员服务

16+阅读 · 2022年3月3日

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

专知会员服务

27+阅读 · 2022年3月3日

【CVPR2020】通过自适应GANs生成不同的图像，Diverse Image Generation via Self-Conditioned GANs

专知会员服务

34+阅读 · 2020年6月19日