AI-generated content has attracted lots of attention recently, but photo-realistic video synthesis is still challenging. Although many attempts using GANs and autoregressive models have been made in this area, the visual quality and length of generated videos are far from satisfactory. Diffusion models have shown remarkable results recently but require significant computational resources. To address this, we introduce lightweight video diffusion models by leveraging a low-dimensional 3D latent space, significantly outperforming previous pixel-space video diffusion models under a limited computational budget. In addition, we propose hierarchical diffusion in the latent space such that longer videos with more than one thousand frames can be produced. To further overcome the performance degradation issue for long video generation, we propose conditional latent perturbation and unconditional guidance that effectively mitigate the accumulated errors during the extension of video length. Extensive experiments on small domain datasets of different categories suggest that our framework generates more realistic and longer videos than previous strong baselines. We additionally provide an extension to large-scale text-to-video generation to demonstrate the superiority of our work. Our code and models will be made publicly available.
翻译:人工智能生成的内容近来引起了很多关注,但逼真的视频合成仍然具有挑战性。虽然在这方面使用 GAN 和自回归模型的许多尝试已经做出了,但生成的视频的视觉质量和长度还远远不够令人满意。最近的扩散模型表现出了显著的效果,但需要大量的计算资源。为了解决这个问题,我们通过利用一个低维的 3D 潜在空间提出了轻量级视频扩散模型,且显著优于以前的视频扩散模型在有限的计算预算下。此外,我们提出了一种在潜在空间中的分层扩散,以便产生更长的包含一千多帧的视频。为了进一步克服长视频生成中的性能退化问题,我们提出了条件潜在扰动和无条件指导,这有效地减轻了视频长度扩展过程中的累积误差。在不同类别的小领域数据集上进行的广泛实验表明,我们的框架生成的视频比以前的强基线更逼真和更长。我们还提供了大规模文本到视频生成的扩展,以展示我们的工作的优越性。我们的代码和模型将公开提供。