The diffusion-based generative models have achieved remarkable success in text-based image generation. However, since it contains enormous randomness in generation progress, it is still challenging to apply such models for real-world visual content editing, especially in videos. In this paper, we propose FateZero, a zero-shot text-based editing method on real-world videos without per-prompt training or use-specific mask. To edit videos consistently, we propose several techniques based on the pre-trained models. Firstly, in contrast to the straightforward DDIM inversion technique, our approach captures intermediate attention maps during inversion, which effectively retain both structural and motion information. These maps are directly fused in the editing process rather than generated during denoising. To further minimize semantic leakage of the source video, we then fuse self-attentions with a blending mask obtained by cross-attention features from the source prompt. Furthermore, we have implemented a reform of the self-attention mechanism in denoising UNet by introducing spatial-temporal attention to ensure frame consistency. Yet succinct, our method is the first one to show the ability of zero-shot text-driven video style and local attribute editing from the trained text-to-image model. We also have a better zero-shot shape-aware editing ability based on the text-to-video model. Extensive experiments demonstrate our superior temporal consistency and editing capability than previous works.
翻译:以传播为基础的基因模型在基于文本的图像生成方面取得了显著的成功。 但是,由于它包含在生成过程中的巨大随机性,因此应用这种模型用于真实世界的视觉内容编辑,特别是在视频中,仍然具有挑战性。 在本文中,我们提议在不经过一次即时培训或使用特定面罩的情况下对真实世界的视频采用零点文本编辑方法FateZero。为了不断编辑视频,我们根据预先培训的模型提出了几种技术。首先,与直接的DDIM转换技术相比,我们的方法在转换过程中捕捉了中间关注地图,这实际上保留了结构信息和运动信息。这些地图直接与编辑过程结合,而不是在删除过程中生成。为了进一步尽量减少源视频的语义泄漏,我们然后将自我注意与通过源的交叉控制特征获得的混合面罩结合起来。此外,我们采用了一种自留机制的改革,通过引入空间时空调关注来确保框架的一致性。但简洁的是,我们的方法是第一个在编辑过程中直接结合了编辑过程,而不是在解调过程中生成了编辑过程的精度。我们所训练的文本模型展示了从零度的图像编辑能力,我们所动的原版式的图像的图像格式,我们还展示了以展示了一种更精度上更精确的缩缩缩的图像的图像编辑能力。</s>