Image diffusion models, trained on massive image collections, have emerged as the most versatile image generator model in terms of quality and diversity. They support inverting real images and conditional (e.g., text) generation, making them attractive for high-quality image editing applications. We investigate how to use such pre-trained image models for text-guided video editing. The critical challenge is to achieve the target edits while still preserving the content of the source video. Our method works in two simple steps: first, we use a pre-trained structure-guided (e.g., depth) image diffusion model to perform text-guided edits on an anchor frame; then, in the key step, we progressively propagate the changes to the future frames via self-attention feature injection to adapt the core denoising step of the diffusion model. We then consolidate the changes by adjusting the latent code for the frame before continuing the process. Our approach is training-free and generalizes to a wide range of edits. We demonstrate the effectiveness of the approach by extensive experimentation and compare it against four different prior and parallel efforts (on ArXiv). We demonstrate that realistic text-guided video edits are possible, without any compute-intensive preprocessing or video-specific finetuning.
翻译:图像扩散模型在大规模图像集合上进行训练,已成为质量和多样性方面最具多功能的图像生成器模型。它们支持翻转真实图像和有条件的生成,因此在高质量图像编辑应用中具有吸引力。我们研究如何使用这种预训练的图像模型进行文本引导的视频编辑。其中的关键挑战在于,在保留源视频内容的同时实现目标编辑。我们的方法分为两个简单的步骤:首先,我们使用预先训练的结构引导(例如深度)图像扩散模型对锚点帧进行文本引导的编辑;然后,在关键步骤中,我们通过自我注意力特征注入逐渐将变化传播到未来的帧中,以适应扩散模型的核心去噪步骤。然后,我们通过调整上一帧的潜在代码来巩固这些变化后继续整个过程。我们的方法无需训练即可泛化到各种编辑。我们通过广泛的实验展示了该方法的有效性,并与四个不同的先前和并行工作(在ArXiv上)进行比较。我们证明了文本引导的现实视频编辑是可能的,无需任何计算密集型的预处理或视频特定的微调。