Can a generative model be trained to produce images from a specific domain, guided by a text prompt only, without seeing any image? In other words: can an image generator be trained blindly? Leveraging the semantic power of large scale Contrastive-Language-Image-Pre-training (CLIP) models, we present a text-driven method that allows shifting a generative model to new domains, without having to collect even a single image from those domains. We show that through natural language prompts and a few minutes of training, our method can adapt a generator across a multitude of domains characterized by diverse styles and shapes. Notably, many of these modifications would be difficult or outright impossible to reach with existing methods. We conduct an extensive set of experiments and comparisons across a wide range of domains. These demonstrate the effectiveness of our approach and show that our shifted models maintain the latent-space properties that make generative models appealing for downstream tasks.
翻译:能否训练基因模型从特定领域产生图像, 仅以文本提示为指导, 而不看到任何图像? 换句话说: 图像生成器能否盲目训练? 如何利用大规模对比- 语言- 图像- 预演( CLIP) 模型的语义力量? 我们提出了一个文本驱动方法, 允许将基因模型转换到新的领域, 而不必从这些领域收集单一图像 。 我们通过自然语言提示和几分钟的培训, 显示我们的方法可以使生成器适应以不同风格和形状为特征的众多领域。 值得注意的是, 许多这些修改很难或根本无法利用现有方法。 我们进行一系列广泛的实验和比较, 展示了我们的方法的有效性, 并表明我们被改变的模型保持着潜在空间特性, 使得基因化模型能够吸引下游任务 。