Pix2Video: 使用图像扩散进行视频编辑 (Pix2Video: Video Editing using Image Diffusion)

Image diffusion models, trained on massive image collections, have emerged as the most versatile image generator model in terms of quality and diversity. They support inverting real images and conditional (e.g., text) generation, making them attractive for high-quality image editing applications. We investigate how to use such pre-trained image models for text-guided video editing. The critical challenge is to achieve the target edits while still preserving the content of the source video. Our method works in two simple steps: first, we use a pre-trained structure-guided (e.g., depth) image diffusion model to perform text-guided edits on an anchor frame; then, in the key step, we progressively propagate the changes to the future frames via self-attention feature injection to adapt the core denoising step of the diffusion model. We then consolidate the changes by adjusting the latent code for the frame before continuing the process. Our approach is training-free and generalizes to a wide range of edits. We demonstrate the effectiveness of the approach by extensive experimentation and compare it against four different prior and parallel efforts (on ArXiv). We demonstrate that realistic text-guided video edits are possible, without any compute-intensive preprocessing or video-specific finetuning.

翻译：图像扩散模型在大规模图像集合上进行训练，已成为质量和多样性方面最具多功能的图像生成器模型。它们支持翻转真实图像和有条件的生成，因此在高质量图像编辑应用中具有吸引力。我们研究如何使用这种预训练的图像模型进行文本引导的视频编辑。其中的关键挑战在于，在保留源视频内容的同时实现目标编辑。我们的方法分为两个简单的步骤：首先，我们使用预先训练的结构引导（例如深度）图像扩散模型对锚点帧进行文本引导的编辑；然后，在关键步骤中，我们通过自我注意力特征注入逐渐将变化传播到未来的帧中，以适应扩散模型的核心去噪步骤。然后，我们通过调整上一帧的潜在代码来巩固这些变化后继续整个过程。我们的方法无需训练即可泛化到各种编辑。我们通过广泛的实验展示了该方法的有效性，并与四个不同的先前和并行工作（在ArXiv上）进行比较。我们证明了文本引导的现实视频编辑是可能的，无需任何计算密集型的预处理或视频特定的微调。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】视觉提示调整（VPT），Vision Prompt Tuning

专知会员服务

32+阅读 · 2022年3月12日

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

专知会员服务

27+阅读 · 2022年3月3日

【CVPR 2022】可控图像合成与编辑的合成生成先验学习，SemanticStyleGAN: Learning Compositonal Generative Priors for Controllable Image Synthesis and Editing

专知会员服务

23+阅读 · 2022年3月3日

【CVPR 2022】使用多模态Transformer的端到端视频对象分割，End-to-End Referring Video Object Segmentation with Multimodal Transformer

专知会员服务

28+阅读 · 2022年3月3日