调整视频：用于文字到视频生成的单次调整图像扩散模型 (Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation)

To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator. Despite their promising results, such paradigm is computationally expensive. In this work, we propose a new T2V generation setting$\unicode{x2014}$One-Shot Video Tuning, where only one text-video pair is presented. Our model is built on state-of-the-art T2I diffusion models pre-trained on massive image data. We make two key observations: 1) T2I models can generate still images that represent verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we introduce Tune-A-Video, which involves a tailored spatio-temporal attention mechanism and an efficient one-shot tuning strategy. At inference, we employ DDIM inversion to provide structure guidance for sampling. Extensive qualitative and numerical experiments demonstrate the remarkable ability of our method across various applications.

翻译：为了复制文本到图像（T2I）生成的成功，最近的工作采用大规模视频数据集来训练文本到视频(T2V)生成器。尽管它们有着很有前途的结果，但这种范例计算成本昂贵。在这项工作中，我们提出了一种新的T2V生成设置——单次视频调整，在其中只提供一个文本-视频对。我们的模型基于预先训练的最先进的T2I扩散模型，该模型基于海量的图像数据。我们做出两个关键观察：1）T2I模型可以生成表示动词术语的静止图像；2）扩展T2I模型以同时生成多个图像表现出惊人的内容一致性。为了进一步学习连续的运动，我们引入了Tune-A-Video，它涉及一个量身定制的时空注意机制和一个高效的单次调整策略。在推理中，我们采用DDIM反演来为抽样提供结构指导。广泛的定性和数值实验证明了我们的方法在各种应用中卓越的能力。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【AAAI2023】用于复杂场景图像合成的特征金字塔扩散模型

专知会员服务

22+阅读 · 2022年12月5日

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

专知会员服务

27+阅读 · 2022年3月3日

【CVPR2020】通过自适应GANs生成不同的图像，Diverse Image Generation via Self-Conditioned GANs

专知会员服务

34+阅读 · 2020年6月19日

【CVPR2020】语义增强的场景文本识别的编码-解码器框架，SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition

专知会员服务

25+阅读 · 2020年5月22日