AI-generated content has attracted lots of attention recently, but photo-realistic video synthesis is still challenging. Although many attempts using GANs and autoregressive models have been made in this area, the visual quality and length of generated videos are far from satisfactory. Diffusion models (DMs) are another class of deep generative models and have recently achieved remarkable performance on various image synthesis tasks. However, training image diffusion models usually requires substantial computational resources to achieve a high performance, which makes expanding diffusion models to high-dimensional video synthesis tasks more computationally expensive. To ease this problem while leveraging its advantages, we introduce lightweight video diffusion models that synthesize high-fidelity and arbitrary-long videos from pure noise. Specifically, we propose to perform diffusion and denoising in a low-dimensional 3D latent space, which significantly outperforms previous methods on 3D pixel space when under a limited computational budget. In addition, though trained on tens of frames, our models can generate videos with arbitrary lengths, i.e., thousands of frames, in an autoregressive way. Finally, conditional latent perturbation is further introduced to reduce performance degradation during synthesizing long-duration videos. Extensive experiments on various datasets and generated lengths suggest that our framework is able to sample much more realistic and longer videos than previous approaches, including GAN-based, autoregressive-based, and diffusion-based methods.
翻译:AI 生成的内容最近引起了许多关注,但照片现实的视频合成仍然具有挑战性。虽然许多使用GANs和自动递增模型的尝试在这一领域已经做出了许多尝试,但生成的视频的视觉质量和长度远非令人满意。Dubil 模型(DMs)是另一类深层基因模型,最近在各种图像合成任务上取得了显著的成绩。然而,培训图像传播模型通常需要大量的计算资源才能达到高性能,使推广模型向高维视频合成任务扩展的成本更高。为了在利用其优势的同时缓解这一问题,我们引入了轻量视频传播模型,这些模型综合了从纯噪音中产生的高不真实性和任意的递增视频。具体地说,我们建议在一个低维3D 3D 潜在空间进行传播模型的传播和分解,这些模型在有限的计算预算下大大超过3D ixel 空间的以往方法。此外,尽管经过数十个框架的培训,我们的模型可以产生任意的长度,即以千个框架为基础,以自动递增的方式生成视频。最后,有条件的潜潜透性透度和分流化的摄像框架将进一步引入了我们以往的图像模型,从而可以降低GAN 。