Make-A-Video:没有文本-Video数据的文本到视频生成 (Make-A-Video: Text-to-Video Generation without Text-Video Data)

We propose Make-A-Video -- an approach for directly translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V). Our intuition is simple: learn what the world looks like and how it is described from paired text-image data, and learn how the world moves from unsupervised video footage. Make-A-Video has three advantages: (1) it accelerates training of the T2V model (it does not need to learn visual and multimodal representations from scratch), (2) it does not require paired text-video data, and (3) the generated videos inherit the vastness (diversity in aesthetic, fantastical depictions, etc.) of today's image generation models. We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules. First, we decompose the full temporal U-Net and attention tensors and approximate them in space and time. Second, we design a spatial temporal pipeline to generate high resolution and frame rate videos with a video decoder, interpolation model and two super resolution models that can enable various applications besides T2V. In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation, as determined by both qualitative and quantitative measures.

翻译：我们提出Make-A-Video -- -- 一种直接将文本到图像(T2I)生成(T2V)最近的巨大进展转化成文本到视频(T2V)的巨大进展的方法。我们的直觉很简单:从配对文本图像数据中了解世界的长相和描述方式,了解世界如何从不受监督的视频镜头中走动。Make-A-Video有三个优点:(1)它加速了T2V模型的培训(它不需要从零开始学习视觉和多式联运的表达方式),(2)它不需要配对文本视频数据,(3)所制作的视频继承了今天图像生成模型的广度(美学多样性、奇幻描绘等)。我们设计了一个简单而有效的方法,用新颖和有效的空间时空时时模块构建T2的全时U-Net和注意力。我们设计了一个空间时空管道,我们设计了一个高分辨率和框架的视频,用视频解码、内部图案模型和两个超高质量模型,使得各种分辨率应用成为T-D-D-A-A-S-R-S-S-S-S-Sy-A-Sy-Sy-Sy-Sy-Sy-L-Sy-Sy-Sy-Sy-S-Sy-Sy-Sy-Sy-Sy-S-Sy-Sy-Sy-Sy-S-S-S-S-S-S-Sy-S-S-S-S-S-S-S-S-S-S-S-S-S-S-T-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】利用大规模视频转录推进高分辨率视频语言表示，Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

专知会员服务

8+阅读 · 2022年3月12日

“CVPR 2021 接受论文列表 1663篇论文都在这了

专知会员服务

32+阅读 · 2021年6月12日

最新《Transformers模型》教程，64页ppt

专知会员服务

324+阅读 · 2020年11月26日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日