A diffusion probabilistic model (DPM), which constructs a forward diffusion process by gradually adding noise to data points and learns the reverse denoising process to generate new samples, has been shown to handle complex data distribution. Despite its recent success in image synthesis, applying DPMs to video generation is still challenging due to high-dimensional data spaces. Previous methods usually adopt a standard diffusion process, where frames in the same video clip are destroyed with independent noises, ignoring the content redundancy and temporal correlation. This work presents a decomposed diffusion process via resolving the per-frame noise into a base noise that is shared among all frames and a residual noise that varies along the time axis. The denoising pipeline employs two jointly-learned networks to match the noise decomposition accordingly. Experiments on various datasets confirm that our approach, termed as VideoFusion, surpasses both GAN-based and diffusion-based alternatives in high-quality video generation. We further show that our decomposed formulation can benefit from pre-trained image diffusion models and well-support text-conditioned video creation.
翻译:扩散概率模型(DPM)通过逐步向数据点添加噪声并学习反向降噪过程来构建前向扩散过程,已经显示出处理复杂数据分布的能力。尽管其在图像合成方面的最近成功,但将DPM应用于视频生成仍然具有挑战性,主要是由于高维数据空间。以往的方法通常采用标准扩散过程,在其中同一视频剪辑中的帧是通过独立的噪声破坏的,忽略了内容冗余性和时间相关性。本研究通过将逐帧噪声分解为在所有帧之间共享的基础噪声和沿时间轴变化的残差噪声,提出了一种分解扩散过程。去噪流水线采用两个联合学习的网络,相应地匹配噪声分解。各种数据集上的实验证实,我们的方法,称为视频融合,超越了基于GAN和基于扩散的替代方案,可以实现高质量的视频生成。我们进一步展示,我们的分解公式可以从预先训练的图像扩散模型中受益,并可以支持文本条件的视频创建。