Predicting and anticipating future outcomes or reasoning about missing information in a sequence are critical skills for agents to be able to make intelligent decisions. This requires strong, temporally coherent generative capabilities. Diffusion models have shown remarkable success in several generative tasks, but have not been extensively explored in the video domain. We present Random-Mask Video Diffusion (RaMViD), which extends image diffusion models to videos using 3D convolutions, and introduces a new conditioning technique during training. By varying the mask we condition on, the model is able to perform video prediction, infilling, and upsampling. Due to our simple conditioning scheme, we can utilize the same architecture as used for unconditional training, which allows us to train the model in a conditional and unconditional fashion at the same time. We evaluate the model on two benchmark datasets for video prediction, on which we achieve state-of-the-art results, and one for video generation.
翻译:预测和预测未来的结果或对缺失信息的推理按顺序排列,是代理人能够做出明智决定的关键技能。这需要强大的、时间上一致的基因能力。 传播模型在一些基因化任务中取得了显著成功,但在视频领域没有进行广泛探索。 我们展示随机Mask视频传播模型(RAMViD),将图像传播模型扩大到使用3D演进的视频,并在培训过程中引入一种新的调制技术。通过改变我们所依赖的面具,该模型能够进行视频预测、填充和取样。由于我们简单的调整方案,我们可以使用用于无条件培训的同一结构,从而使我们能够同时以有条件和无条件的方式培训模型。我们评估了两个视频预测基准数据集的模型,我们在这个模型上取得了最新的结果,一个模型用于视频生成。