Predicting and anticipating future outcomes or reasoning about missing information in a sequence are critical skills for agents to be able to make intelligent decisions. This requires strong, temporally coherent generative capabilities. Diffusion models have shown remarkable success in several generative tasks, but have not been extensively explored in the video domain. We present Random-Mask Video Diffusion (RaMViD), which extends image diffusion models to videos using 3D convolutions, and introduces a new conditioning technique during training. By varying the mask we condition on, the model is able to perform video prediction, infilling, and upsampling. Due to our simple conditioning scheme, we can utilize the same architecture as used for unconditional training, which allows us to train the model in a conditional and unconditional fashion at the same time. We evaluate RaMViD on two benchmark datasets for video prediction, on which we achieve state-of-the-art results, and one for video generation. High-resolution videos are provided at https://sites.google.com/view/video-diffusion-prediction.
翻译:预测和预测未来的结果或按顺序对缺失信息进行推理是代理人能够作出明智决定的关键技能。 这需要强大的、时间上一致的基因能力。 传播模型在一些基因化任务中取得了显著成功,但在视频领域没有进行广泛探索。 我们展示随机Mask视频传播(RaMViD),将图像传播模型扩大到使用3D演艺的视频,并在培训期间引入一种新的调制技术。通过改变我们所戴的面具,该模型能够进行视频预测、填充和取样。由于我们简单的调整方案,我们可以使用用于无条件培训的同一结构,从而使我们能够同时以有条件和无条件的方式培训模型。 我们用两个视频预测基准数据集对RaMViD进行评估,我们在这个模型上取得最新结果,一个用于视频生成。高分辨率视频见https://sites.google.com/view/view-divation-difffrection。