Large-scale text-to-image diffusion models achieve unprecedented success in image generation and editing. However, how to extend such success to video editing is unclear. Recent initial attempts at video editing require significant text-to-video data and computation resources for training, which is often not accessible. In this work, we propose vid2vid-zero, a simple yet effective method for zero-shot video editing. Our vid2vid-zero leverages off-the-shelf image diffusion models, and doesn't require training on any video. At the core of our method is a null-text inversion module for text-to-video alignment, a cross-frame modeling module for temporal consistency, and a spatial regularization module for fidelity to the original video. Without any training, we leverage the dynamic nature of the attention mechanism to enable bi-directional temporal modeling at test time. Experiments and analyses show promising results in editing attributes, subjects, places, etc., in real-world videos. Code will be made available at \url{https://github.com/baaivision/vid2vid-zero}.
翻译:大规模的文本到图像扩散模型在图像生成和编辑方面取得了前所未有的成功。然而,如何将这样的成功延伸到视频编辑领域尚不清楚。最近的初步尝试需要大量的文本到视频数据和计算资源进行训练,这在很多情况下都是不可访问的。在本文中,我们提出了 vid2vid-zero,一种简单而有效的零样本视频编辑方法。我们的 vid2vid-zero 利用现成的图像扩散模型,不需要对任何视频进行训练。我们方法的核心是一个零文本反演模块,用于文本到视频的对齐;一个跨帧建模模块,用于实现时间上的一致性;以及一个空间正则化模块,用于保持对原始视频的保真度。在没有任何训练的情况下,我们利用注意力机制的动态性质,实现了双向的时间建模。实验证明,我们的方法在编辑属性、主题和场景等方面在现实视频中取得了有希望的结果。代码将在 \url{https://github.com/baaivision/vid2vid-zero} 上公开。