Text-guided generative diffusion models unlock powerful image creation and editing tools. While these have been extended to video generation, current approaches that edit the content of existing footage while retaining structure require expensive re-training for every input or rely on error-prone propagation of image edits across frames. In this work, we present a structure and content-guided video diffusion model that edits videos based on visual or textual descriptions of the desired output. Conflicts between user-provided content edits and structure representations occur due to insufficient disentanglement between the two aspects. As a solution, we show that training on monocular depth estimates with varying levels of detail provides control over structure and content fidelity. Our model is trained jointly on images and videos which also exposes explicit control of temporal consistency through a novel guidance method. Our experiments demonstrate a wide variety of successes; fine-grained control over output characteristics, customization based on a few reference images, and a strong user preference towards results by our model.
翻译:文本引导的基因化传播模型释放了强大的图像创建和编辑工具。 虽然这些模型已经扩展到视频生成,但目前编辑现有视频片段内容、同时保留结构的方法需要为每个输入内容进行昂贵的再培训,或者依靠容易出错的图像编辑跨框架传播。在这项工作中,我们提出了一个结构和内容引导的视频传播模型,根据对理想产出的视觉或文字描述编辑视频。用户提供的内容编辑和结构表述发生冲突,原因是两个方面没有足够分解。作为一种解决办法,我们展示了以不同详细程度的单视深度估算培训可以控制结构和内容的忠实性。我们的模型在图像和视频上联合接受培训,这些图像和视频还暴露了通过新式指导方法对时间一致性的明确控制。我们的实验展示了广泛的成功;对输出特性的精细控制,基于少数参考图像的定制,以及用户对模型结果的强烈偏好。