Modern deep learning methods typically treat image sequences as large tensors of sequentially stacked frames. However, is this straightforward representation ideal given the current state-of-the-art (SoTA)? In this work, we address this question in the context of generative models and aim to devise a more effective way of modeling image sequence data. Observing the inefficiencies and bottlenecks of current SoTA image sequence generation methods, we showcase that rather than working with large tensors, we can improve the generation process by factorizing it into first generating the coarse sequence at low resolution and then refining the individual frames at high resolution. We train a generative model solely on grid images comprising subsampled frames. Yet, we learn to generate image sequences, using the strong self-attention mechanism of the Diffusion Transformer (DiT) to capture correlations between frames. In effect, our formulation extends a 2D image generator to operate as a low-resolution 3D image-sequence generator without introducing any architectural modifications. Subsequently, we super-resolve each frame individually to add the sequence-independent high-resolution details. This approach offers several advantages and can overcome key limitations of the SoTA in this domain. Compared to existing image sequence generation models, our method achieves superior synthesis quality and improved coherence across sequences. It also delivers high-fidelity generation of arbitrary-length sequences and increased efficiency in inference time and training data usage. Furthermore, our straightforward formulation enables our method to generalize effectively across diverse data domains, which typically require additional priors and supervision to model in a generative context. Our method consistently outperforms SoTA in quality and inference speed (at least twice-as-fast) across datasets.
翻译:现代深度学习方法通常将图像序列视为按顺序堆叠帧构成的大型张量。然而,鉴于当前最先进技术(SoTA)的发展,这种直观的表示方式是否理想?在本工作中,我们从生成模型的角度探讨这一问题,旨在设计一种更有效的图像序列数据建模方法。通过观察当前SoTA图像序列生成方法的低效性与瓶颈,我们证明:相较于直接处理大型张量,通过将生成过程分解为先生成低分辨率粗粒度序列,再逐帧进行高分辨率细化,能够有效改进生成效果。我们仅在包含下采样帧的网格图像上训练生成模型,却能通过学习利用扩散Transformer(DiT)强大的自注意力机制捕捉帧间相关性,从而生成完整图像序列。实际上,我们的方法将二维图像生成器扩展为低分辨率三维图像序列生成器,且无需引入任何架构修改。随后,我们通过逐帧超分辨率重建来添加与序列无关的高分辨率细节。该方法具有多重优势,能够克服该领域SoTA方法的关键局限。与现有图像序列生成模型相比,我们的方法在合成质量与序列连贯性方面表现更优,能够实现任意长度序列的高保真生成,并在推理时间和训练数据使用效率上显著提升。此外,我们简洁的模型架构使方法能够有效泛化至不同数据领域——这些领域在生成建模中通常需要额外先验知识与监督机制。在多个数据集中,我们的方法在生成质量与推理速度(至少快两倍)上持续超越SoTA方法。