Generating temporally coherent high fidelity video is an important milestone in generative modeling research. We make progress towards this milestone by proposing a diffusion model for video generation that shows very promising initial results. Our model is a natural extension of the standard image diffusion architecture, and it enables jointly training from image and video data, which we find to reduce the variance of minibatch gradients and speed up optimization. To generate long and higher resolution videos we introduce a new conditional sampling technique for spatial and temporal video extension that performs better than previously proposed methods. We present the first results on a large text-conditioned video generation task, as well as state-of-the-art results on established benchmarks for video prediction and unconditional video generation. Supplementary material is available at https://video-diffusion.github.io/
翻译:生成具有时间一致性的高忠诚度视频是基因模型研究的一个重要里程碑。我们通过为视频生成提出一个传播模型,展示出非常有希望的初步结果,从而朝着这一里程碑取得进展。我们的模型是标准图像传播架构的自然延伸,它能够通过图像和视频数据进行联合培训,我们发现,通过图像和视频数据可以减少微型批量梯度的差异,加快优化。为生成长高分辨率视频,我们为空间和时间视频扩展引入了新的有条件的测试技术,其效果比以前建议的方法更好。我们介绍了大量以文字为条件的视频生成任务的初步结果,以及视频预测和无条件视频生成既定基准的最新结果。补充材料可在 https://Vicevia-difulation.github.io/上查阅。