Can a generative model be trained to produce images from a specific domain, guided by a text prompt only, without seeing any image? In other words: can an image generator be trained "blindly"? Leveraging the semantic power of large scale Contrastive-Language-Image-Pre-training (CLIP) models, we present a text-driven method that allows shifting a generative model to new domains, without having to collect even a single image. We show that through natural language prompts and a few minutes of training, our method can adapt a generator across a multitude of domains characterized by diverse styles and shapes. Notably, many of these modifications would be difficult or outright impossible to reach with existing methods. We conduct an extensive set of experiments and comparisons across a wide range of domains. These demonstrate the effectiveness of our approach and show that our shifted models maintain the latent-space properties that make generative models appealing for downstream tasks.
翻译:能否训练基因模型从特定领域产生图像, 仅以文本提示为指导, 而不看到任何图像? 换句话说: 图像生成器能否“ 盲目地” 培训? 如何利用大型对比语言- 语言- 图像- 预演培训模式的语义力量? 我们提出了一个文本驱动方法, 允许将基因模型转换到新的领域, 而不必收集任何单一图像。 我们通过自然语言提示和几分钟的培训, 显示我们的方法可以适应以不同风格和形状为特征的多个领域的生成器。 值得注意的是, 许多这些修改方法很难或根本无法与现有方法相接触。 我们进行一系列广泛的实验和比较, 这表明我们的方法的有效性, 并表明我们被改变的模式保持着潜在空间特性, 使基因化模型吸引下游任务。