To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator. Despite their promising results, such paradigm is computationally expensive. In this work, we propose a new T2V generation setting$\unicode{x2014}$One-Shot Video Tuning, where only one text-video pair is presented. Our model is built on state-of-the-art T2I diffusion models pre-trained on massive image data. We make two key observations: 1) T2I models can generate still images that represent verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we introduce Tune-A-Video, which involves a tailored spatio-temporal attention mechanism and an efficient one-shot tuning strategy. At inference, we employ DDIM inversion to provide structure guidance for sampling. Extensive qualitative and numerical experiments demonstrate the remarkable ability of our method across various applications.
翻译:为了复制文本到图像(T2I)生成的成功,最近的工作采用大规模视频数据集来训练文本到视频(T2V)生成器。尽管它们有着很有前途的结果,但这种范例计算成本昂贵。在这项工作中,我们提出了一种新的T2V生成设置——单次视频调整,在其中只提供一个文本-视频对。我们的模型基于预先训练的最先进的T2I扩散模型,该模型基于海量的图像数据。我们做出两个关键观察:1)T2I模型可以生成表示动词术语的静止图像;2)扩展T2I模型以同时生成多个图像表现出惊人的内容一致性。为了进一步学习连续的运动,我们引入了Tune-A-Video,它涉及一个量身定制的时空注意机制和一个高效的单次调整策略。在推理中,我们采用DDIM反演来为抽样提供结构指导。广泛的定性和数值实验证明了我们的方法在各种应用中卓越的能力。