We propose Make-A-Video -- an approach for directly translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V). Our intuition is simple: learn what the world looks like and how it is described from paired text-image data, and learn how the world moves from unsupervised video footage. Make-A-Video has three advantages: (1) it accelerates training of the T2V model (it does not need to learn visual and multimodal representations from scratch), (2) it does not require paired text-video data, and (3) the generated videos inherit the vastness (diversity in aesthetic, fantastical depictions, etc.) of today's image generation models. We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules. First, we decompose the full temporal U-Net and attention tensors and approximate them in space and time. Second, we design a spatial temporal pipeline to generate high resolution and frame rate videos with a video decoder, interpolation model and two super resolution models that can enable various applications besides T2V. In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation, as determined by both qualitative and quantitative measures.
翻译:我们提出Make-A-Video -- -- 一种直接将文本到图像(T2I)生成(T2V)最近的巨大进展转化成文本到视频(T2V)的巨大进展的方法。我们的直觉很简单:从配对文本图像数据中了解世界的长相和描述方式,了解世界如何从不受监督的视频镜头中走动。Make-A-Video有三个优点:(1)它加速了T2V模型的培训(它不需要从零开始学习视觉和多式联运的表达方式),(2)它不需要配对文本视频数据,(3)所制作的视频继承了今天图像生成模型的广度(美学多样性、奇幻描绘等)。我们设计了一个简单而有效的方法,用新颖和有效的空间时空时时模块构建T2的全时U-Net和注意力。我们设计了一个空间时空管道,我们设计了一个高分辨率和框架的视频,用视频解码、内部图案模型和两个超高质量模型,使得各种分辨率应用成为T-D-D-A-A-S-R-S-S-S-S-Sy-A-Sy-Sy-Sy-Sy-Sy-L-Sy-Sy-Sy-Sy-S-Sy-Sy-Sy-Sy-Sy-S-Sy-Sy-Sy-Sy-S-S-S-S-S-S-Sy-S-S-S-S-S-S-S-S-S-S-S-S-S-S-T-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S