We present a framework for video modeling based on denoising diffusion probabilistic models that produces long-duration video completions in a variety of realistic environments. We introduce a generative model that can at test-time sample any arbitrary subset of video frames conditioned on any other subset and present an architecture adapted for this purpose. Doing so allows us to efficiently compare and optimize a variety of schedules for the order in which frames in a long video are sampled and use selective sparse and long-range conditioning on previously sampled frames. We demonstrate improved video modeling over prior work on a number of datasets and sample temporally coherent videos over 25 minutes in length. We additionally release a new video modeling dataset and semantically meaningful metrics based on videos generated in the CARLA self-driving car simulator.
翻译:我们提出了一个基于分解扩散概率概率模型的视频建模框架,在各种现实环境中产生长期视频完成率。我们引入了一种基因模型,在测试时可以取样任何任意的、以任何其他子集为条件的视频框架子集,并展示一个为此而调整的架构。这样,我们就能够有效地比较和优化各种时间表,以确定一个长视频框的取样顺序,并在以前取样的框上使用选择性的稀疏和长距离调控。我们展示了一些数据集的视频建模改进了以往工作的改进,并取样了超过25分钟的时间一致的视频。我们还发布了一个新的视频建模数据集和基于CARLA自驾驶模拟器所制作的视频的具有语义意义的计量标准。