Video prediction is a challenging task. The quality of video frames from current state-of-the-art (SOTA) generative models tends to be poor and generalization beyond the training data is difficult. Furthermore, existing prediction frameworks are typically not capable of simultaneously handling other video-related tasks such as unconditional generation or interpolation. In this work, we devise a general-purpose framework called Masked Conditional Video Diffusion (MCVD) for all of these video synthesis tasks using a probabilistic conditional score-based denoising diffusion model, conditioned on past and/or future frames. We train the model in a manner where we randomly and independently mask all the past frames or all the future frames. This novel but straightforward setup allows us to train a single model that is capable of executing a broad range of video tasks, specifically: future/past prediction -- when only future/past frames are masked; unconditional generation -- when both past and future frames are masked; and interpolation -- when neither past nor future frames are masked. Our experiments show that this approach can generate high-quality frames for diverse types of videos. Our MCVD models are built from simple non-recurrent 2D-convolutional architectures, conditioning on blocks of frames and generating blocks of frames. We generate videos of arbitrary lengths autoregressively in a block-wise manner. Our approach yields SOTA results across standard video prediction and interpolation benchmarks, with computation times for training models measured in 1-12 days using $\le$ 4 GPUs. https://mask-cond-video-diffusion.github.io
翻译:视频预测是一项具有挑战性的任务。 来自当前最新工艺(SOTA)的视频框架的变异模型的质量往往很差,超出培训数据的范围很难。此外,现有的预测框架通常无法同时处理其他视频相关任务,例如无条件生成或内插。在这项工作中,我们为所有这些视频合成任务设计了一个通用框架,称为“蒙面附加条件视频扩散(MCVD) ” (MCDVD), 使用一种以过去和(或)未来框架为条件的概率性有条件分分解传播模型。我们以随机和独立的方式将以往框架或未来框架掩盖起来。我们培训模型的方式是随机和独立地掩盖所有过去的框架或所有未来框架。这个新颖但直接的预测框架使我们无法同时处理其他视频相关任务。 具体地说:未来/帕斯特预测 -- 只有未来/平面框架被遮掩蔽;无条件生成 -- -- 当过去和未来的框架都被遮掩时;以及内部推断 -- -- 当过去和未来的框架都没有被遮掩时。我们的实验显示,这一方法可以生成高品质框架框架的高质量框架框架,用来测量不同类型视频周期的多度,我们SOMVF框架的自动框架的模型中生成的模型中,我们的标准框架的模型中生成了一种标准结构。