A diffusion probabilistic model (DPM), which constructs a forward diffusion process by gradually adding noise to data points and learns the reverse denoising process to generate new samples, has been shown to handle complex data distribution. Despite its recent success in image synthesis, applying DPMs to video generation is still challenging due to high-dimensional data spaces. Previous methods usually adopt a standard diffusion process, where frames in the same video clip are destroyed with independent noises, ignoring the content redundancy and temporal correlation. This work presents a decomposed diffusion process via resolving the per-frame noise into a base noise that is shared among all frames and a residual noise that varies along the time axis. The denoising pipeline employs two jointly-learned networks to match the noise decomposition accordingly. Experiments on various datasets confirm that our approach, termed as VideoFusion, surpasses both GAN-based and diffusion-based alternatives in high-quality video generation. We further show that our decomposed formulation can benefit from pre-trained image diffusion models and well-support text-conditioned video creation.
翻译:扩散概率模型(DPM)通过逐渐向数据点添加噪音构建正向扩散过程,并学习反向降噪过程以生成新样本。尽管在图像合成方面取得了最近的成功,但将DPM应用于视频生成仍具有挑战性,因为数据空间的维数很高。以往的方法通常采用标准扩散过程,在同一个视频片段中的帧被独立噪声破坏,在忽略内容冗余和时间相关性方面。本研究通过将逐帧噪声分解为共享在所有帧之间的基础噪声和沿时间轴变化的残余噪声,提出了一种分解扩散过程。降噪管道采用两个联合学习的网络,相应地匹配噪声分解。在各种数据集上的实验验证了我们的方法(称为VideoFusion)在高质量视频生成方面超越了基于GAN和扩散的替代方法。我们进一步表明,我们的分解公式可以受益于预训练的图像扩散模型,并良好地支持基于文本的视频创作。