This paper presents Video-P2P, a novel framework for real-world video editing with cross-attention control. While attention control has proven effective for image editing with pre-trained image generation models, there are currently no large-scale video generation models publicly available. Video-P2P addresses this limitation by adapting an image generation diffusion model to complete various video editing tasks. Specifically, we propose to first tune a Text-to-Set (T2S) model to complete an approximate inversion and then optimize a shared unconditional embedding to achieve accurate video inversion with a small memory cost. For attention control, we introduce a novel decoupled-guidance strategy, which uses different guidance strategies for the source and target prompts. The optimized unconditional embedding for the source prompt improves reconstruction ability, while an initialized unconditional embedding for the target prompt enhances editability. Incorporating the attention maps of these two branches enables detailed editing. These technical designs enable various text-driven editing applications, including word swap, prompt refinement, and attention re-weighting. Video-P2P works well on real-world videos for generating new characters while optimally preserving their original poses and scenes. It significantly outperforms previous approaches.
翻译:本文展示了视频P2P, 这是真实世界视频编辑的新框架, 并带有交叉注意控制。 虽然关注控制已证明对图像编辑有效, 使用经过预先训练的图像生成模型, 但目前还没有公布大规模视频生成模型。 视频P2P通过修改图像生成扩散模型来应对这一限制, 以完成各种视频编辑任务。 具体地说, 我们提议首先调整文本到视频( T2S) 模型, 以完成大致的转换, 然后优化共享无条件嵌入, 以小的记忆成本实现准确的视频转换。 为了关注控制, 我们引入了新型的脱钩指导策略, 该战略对源和目标提示使用不同的指导策略 。 优化的源代码无条件嵌入将提高重建能力, 而初始化的无条件嵌入目标快速增强可编辑性 。 整合这两个分支的注意地图可以进行详细编辑 。 这些技术设计使得各种文本驱动的编辑应用程序, 包括换字、 快速改进和关注重标。 为了关注, 视频P2P2P在真实世界视频视频视频上很好地制作了新字符, 同时优化保存原始形状和图像。</s>