State-of-the-art video generative models typically learn the distribution of video latents in the VAE space and map them to pixels using a VAE decoder. While this approach can generate high-quality videos, it suffers from slow convergence and is computationally expensive when generating long videos. In this paper, we introduce SemanticGen, a novel solution to address these limitations by generating videos in the semantic space. Our main insight is that, due to the inherent redundancy in videos, the generation process should begin in a compact, high-level semantic space for global planning, followed by the addition of high-frequency details, rather than directly modeling a vast set of low-level video tokens using bi-directional attention. SemanticGen adopts a two-stage generation process. In the first stage, a diffusion model generates compact semantic video features, which define the global layout of the video. In the second stage, another diffusion model generates VAE latents conditioned on these semantic features to produce the final output. We observe that generation in the semantic space leads to faster convergence compared to the VAE latent space. Our method is also effective and computationally efficient when extended to long video generation. Extensive experiments demonstrate that SemanticGen produces high-quality videos and outperforms state-of-the-art approaches and strong baselines.
翻译:最先进的视频生成模型通常在变分自编码器(VAE)空间中学习视频潜在表示的分布,并通过VAE解码器将其映射到像素空间。尽管这种方法能够生成高质量视频,但其收敛速度较慢,且在生成长视频时计算成本高昂。本文提出语义生成,一种通过在语义空间中生成视频来解决这些局限性的创新方案。我们的核心观点是:由于视频固有的冗余性,生成过程应当始于紧凑的高层语义空间以进行全局规划,随后再添加高频细节,而非直接使用双向注意力机制对大量低层视频标记进行建模。语义生成采用两阶段生成流程:第一阶段通过扩散模型生成紧凑的语义视频特征,这些特征定义了视频的全局布局;第二阶段则通过另一个扩散模型,在语义特征条件下生成VAE潜在表示以产生最终输出。我们观察到,与VAE潜在空间相比,在语义空间中进行生成能实现更快的收敛速度。当扩展至长视频生成时,我们的方法仍保持高效且计算成本可控。大量实验表明,语义生成能够产出高质量视频,其性能优于现有最先进方法及各类强基线模型。