We present a method for zero-shot, text-driven appearance manipulation in natural images and videos. Given an input image or video and a target text prompt, our goal is to edit the appearance of existing objects (e.g., object's texture) or augment the scene with visual effects (e.g., smoke, fire) in a semantically meaningful manner. We train a generator using an internal dataset of training examples, extracted from a single input (image or video and target text prompt), while leveraging an external pre-trained CLIP model to establish our losses. Rather than directly generating the edited output, our key idea is to generate an edit layer (color+opacity) that is composited over the original input. This allows us to constrain the generation process and maintain high fidelity to the original input via novel text-driven losses that are applied directly to the edit layer. Our method neither relies on a pre-trained generator nor requires user-provided edit masks. We demonstrate localized, semantic edits on high-resolution natural images and videos across a variety of objects and scenes.
翻译:在自然图像和视频中,我们展示了一种对自然图像和视频进行零射、文本驱动的外观操纵的方法。在输入图像或视频以及目标文本提示的情况下,我们的目标是以语义上有意义的方式编辑现有对象(例如,对象的纹理)的外观(例如,烟雾、火灾)或以视觉效果(例如,烟雾、火灾)增强场景。我们用从单一输入(图像或视频和目标文本提示)中提取的内部培训实例数据集对生成器进行培训,同时利用外部预先培训的 CLIP 模型来确定我们的损失。我们的关键想法是生成一个编辑层(颜色+opacity),而不是直接生成经过编辑的输出。这使我们能够限制生成过程,并通过直接适用于编辑层的新文本驱动损失保持对原始输入的高度忠诚性。我们的方法既不依赖经过培训的生成器,也不需要用户提供的编辑面罩。我们用高分辨率的自然图像和视频对各种对象和场景进行局部化、语义编辑。