MCVD: 用于预测、生成和国际刑警组织的蒙面条件视频传播 (MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation)

Video prediction is a challenging task. The quality of video frames from current state-of-the-art (SOTA) generative models tends to be poor and generalization beyond the training data is difficult. Furthermore, existing prediction frameworks are typically not capable of simultaneously handling other video-related tasks such as unconditional generation or interpolation. In this work, we devise a general-purpose framework called Masked Conditional Video Diffusion (MCVD) for all of these video synthesis tasks using a probabilistic conditional score-based denoising diffusion model, conditioned on past and/or future frames. We train the model in a manner where we randomly and independently mask all the past frames or all the future frames. This novel but straightforward setup allows us to train a single model that is capable of executing a broad range of video tasks, specifically: future/past prediction -- when only future/past frames are masked; unconditional generation -- when both past and future frames are masked; and interpolation -- when neither past nor future frames are masked. Our experiments show that this approach can generate high-quality frames for diverse types of videos. Our MCVD models are built from simple non-recurrent 2D-convolutional architectures, conditioning on blocks of frames and generating blocks of frames. We generate videos of arbitrary lengths autoregressively in a block-wise manner. Our approach yields SOTA results across standard video prediction and interpolation benchmarks, with computation times for training models measured in 1-12 days using $\le$ 4 GPUs. Project page: https://mask-cond-video-diffusion.github.io ; Code : https://github.com/voletiv/mcvd-pytorch

翻译：视频预测是一项具有挑战性的任务。来自当前最新艺术(SOTA)基因化模型(SOTA)的视频框架质量往往很差,超出培训数据的范围很难。此外,现有的预测框架通常无法同时处理其他视频相关任务,例如无条件生成或内插。在这项工作中,我们为所有这些视频合成任务设计了一个通用框架,称为Maded Conditional Videal Difulation(MCVD),它使用一种以过去和(或)未来框架为条件的概率性、基于分数的分数分解传播模型。我们以随机和独立的方式将所有过去框架或未来框架都掩盖起来。我们用这种模式来培训模型,从而随机和独立地掩盖所有过去框架或未来框架。这种模式让我们能够培训一个能够执行范围广泛的视频任务的单一模型,具体地说:未来/帕斯特预测 -- 只有未来/帕斯特框架(MCVD),当过去和未来的框架都被掩盖时,无条件的模型 -- -- 无论是过去和未来的框架都没有被掩盖。我们的实验显示,这个方法可以产生高质的、用于不同类型图像的图表的图表的图表的图表的节流化框架。我们的模型的模型中,我们用SOMVD格式生成的模型中生成的模型的模型和标准的模型的模型中生成的模型的模型中,我们创建了一种高价框架。