We present a framework for video modeling based on denoising diffusion probabilistic models that produces long-duration video completions in a variety of realistic environments. We introduce a generative model that can at test-time sample any arbitrary subset of video frames conditioned on any other subset and present an architecture adapted for this purpose. Doing so allows us to efficiently compare and optimize a variety of schedules for the order in which frames in a long video are sampled and use selective sparse and long-range conditioning on previously sampled frames. We demonstrate improved video modeling over prior work on a number of datasets and sample temporally coherent videos over 25 minutes in length. We additionally release a new video modeling dataset and semantically meaningful metrics based on videos generated in the CARLA autonomous driving simulator.
翻译:我们提出了一个基于分解扩散概率概率模型的视频建模框架,在各种现实环境中产生长期视频完成率。我们引入了一种基因模型,在测试时可以取样任何任意的、以任何其他子集为条件的视频框架子集,并展示一个为此而调整的架构。这样,我们就能够有效地比较和优化各种时间表,以便根据长视频框的取样顺序,在以前取样的框上使用选择性的稀疏和长距离调控。我们展示了比以往关于若干数据集的工作更好的视频建模和25分钟长的时间连贯视频样本。我们还发布了一个新的视频建模数据集和基于CARLA自动驾驶模拟器所生成的视频的具有语义意义的计量。