Inspired by the ability of StyleGAN to generate highly realistic images in a variety of domains, much recent work has focused on understanding how to use the latent spaces of StyleGAN to manipulate generated and real images. However, discovering semantically meaningful latent manipulations typically involves painstaking human examination of the many degrees of freedom, or an annotated collection of images for each desired manipulation. In this work, we explore leveraging the power of recently introduced Contrastive Language-Image Pre-training (CLIP) models in order to develop a text-based interface for StyleGAN image manipulation that does not require such manual effort. We first introduce an optimization scheme that utilizes a CLIP-based loss to modify an input latent vector in response to a user-provided text prompt. Next, we describe a latent mapper that infers a text-guided latent manipulation step for a given input image, allowing faster and more stable text-based manipulation. Finally, we present a method for mapping a text prompts to input-agnostic directions in StyleGAN's style space, enabling interactive text-driven image manipulation. Extensive results and comparisons demonstrate the effectiveness of our approaches.
翻译:受StyleGAN在不同领域生成高度现实图像的能力的启发,最近许多工作侧重于了解如何利用StyleGAN的潜在空间来操纵生成的图像和实际图像。然而,发现具有内在意义的潜在操纵通常需要人类对自由的多种程度进行艰苦检查,或为每个想要的操作收集附加注释的图像。在这项工作中,我们探索如何利用最近引入的对抗性语言图像培训前模型的力量,以便为StyleGAN图像操作开发一个基于文本的界面,而不需要这种手工操作。我们首先引入一个优化方案,利用基于 CLIP 的损失来修改输入的潜在矢量,以响应用户提供的文本提示。接下来,我们描述一个潜在的映像器,为某种输入图像提供文本指导的潜在操纵步骤,允许更快和更稳定的文本操纵。最后,我们提出了一个方法,用于绘制文本提示到StyleGAN风格空间的输入-不可知性方向,从而能够进行交互式文本驱动图像操纵。我们的广泛结果和比较展示了我们方法的有效性。