To reproduce the success of text-to-image (T2I) generation, recent works in text-to-video (T2V) generation employ large-scale text-video dataset for fine-tuning. However, such paradigm is computationally expensive. Humans have the amazing ability to learn new visual concepts from just one single exemplar. We hereby study a new T2V generation problem$\unicode{x2014}$One-Shot Video Generation, where only a single text-video pair is presented for training an open-domain T2V generator. Intuitively, we propose to adapt the T2I diffusion model pretrained on massive image data for T2V generation. We make two key observations: 1) T2I models are able to generate images that align well with the verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we propose Tune-A-Video with a tailored Sparse-Causal Attention, which generates videos from text prompts via an efficient one-shot tuning of pretrained T2I diffusion models. Tune-A-Video is capable of producing temporally-coherent videos over various applications such as change of subject or background, attribute editing, style transfer, demonstrating the versatility and effectiveness of our method.
翻译:复制文本到图像( T2I) 生成的成功, 最近在文本到视频( T2V) 生成中, 使用大型文本到视频( T2V) 生成的数据集进行微调。 然而, 这种模式在计算上是昂贵的。 人类具有惊人的能力, 能够从一个示例中学习新的视觉概念。 我们在此研究一个新的 T2V 生成问题$\ unicode{x2014} $One- Shot Videgency 生成, 其中只提供单一文本到视频的一对用于培训一个开放式的 T2V 生成器。 我们直观地提议对T2I 生成模型的大规模图像数据进行大规模测试。 我们提出两大关键观察:(1) T2I 模型能够生成与动词一致的图像;(2) 扩展 T2I 模型, 以同时生成多个令人惊讶的图像。 为了进一步学习持续动作, 我们提议Tune- A- Videododoo, 配有定制的 Spress- Causacial at 注意, 通过高效的一幅式背景式图像转换, 生成视频的视频, 用于制作具有可变式的T2I 格式的版本的图像的版本的图像转换式的图像式版本。