To predict and anticipate future outcomes or reason about missing information in a sequence is a key ability for agents to be able to make intelligent decisions. This requires strong temporally coherent generative capabilities. Diffusion models have shown huge success in several generative tasks lately, but have not been extensively explored in the video domain. We present Random-Mask Video Diffusion (RaMViD), which extends image diffusion models to videos using 3D convolutions, and introduces a new conditioning technique during training. By varying the mask we condition on, the model is able to perform video prediction, infilling and upsampling. Since we do not use concatenation to condition on a mask, as done in most conditionally trained diffusion models, we are able to decrease the memory footprint. We evaluated the model on two benchmark datasets for video prediction and one for video generation on which we achieved competitive results. On Kinetics-600 we achieved state-of-the-art for video prediction.
翻译:预测和预测一个序列中缺失信息的未来结果或原因是代理人能够做出明智决定的关键能力。 这需要强大的时间一致性遗传能力。 传播模型最近在一些基因任务中表现出巨大成功,但在视频领域没有进行广泛探索。 我们展示了随机Mask视频传播模型(RaMViD),它将图像传播模型扩大到使用3D演进的视频,并在培训过程中引入了新的调制技术。 通过改变我们所依赖的面具,模型能够进行视频预测、填充和升级。由于我们不使用配方来以面具为条件,正如在最有条件的训练有素的传播模型中所做的那样,我们能够减少记忆足迹。我们评估了两个视频预测基准数据集的模型,以及一个视频生成模型的模型,我们取得了竞争性结果。在Kinitics-600上,我们取得了视频预测的最新技术。