Generating images from natural language instructions is an intriguing yet highly challenging task. We approach text-to-image generation by combining the power of the retrained CLIP representation with an off-the-shelf image generator (GANs), optimizing in the latent space of GAN to find images that achieve maximum CLIP score with the given input text. Compared to traditional methods that train generative models from text to image starting from scratch, the CLIP+GAN approach is training-free, zero shot and can be easily customized with different generators. However, optimizing CLIP score in the GAN space casts a highly challenging optimization problem and off-the-shelf optimizers such as Adam fail to yield satisfying results. In this work, we propose a FuseDream pipeline, which improves the CLIP+GAN approach with three key techniques: 1) an AugCLIP score which robustifies the CLIP objective by introducing random augmentation on image. 2) a novel initialization and over-parameterization strategy for optimization which allows us to efficiently navigate the non-convex landscape in GAN space. 3) a composed generation technique which, by leveraging a novel bi-level optimization formulation, can compose multiple images to extend the GAN space and overcome the data-bias. When promoted by different input text, FuseDream can generate high-quality images with varying objects, backgrounds, artistic styles, even novel counterfactual concepts that do not appear in the training data of the GAN we use. Quantitatively, the images generated by FuseDream yield top-level Inception score and FID score on MS COCO dataset, without additional architecture design or training. Our code is publicly available at \url{https://github.com/gnobitab/FuseDream}.
翻译:从自然语言指令中生成图像是一项令人着迷但极具挑战性的任务。 我们通过将经过再培训的 CLIP 代表的能量与现成图像生成方法( GANs) 相结合, 优化 GAN 的潜伏空间以找到能够以给定输入文本实现最大 CLIP 评分的图像。 与从头到尾从文本到图像的基因化模型培训传统方法相比, CLIP+GAN 方法没有培训,没有射击,并且可以很容易地与不同的发电机定制。 然而, 优化 GAN 空间的 CLIP 评分会给重创的 CLIP 带来一个极具挑战性的优化概念, 而亚当等现成的优化的优化则无法产生令人满意的结果。 在这项工作中, 我们建议 FuseDream 管道可以用三种关键技术改进 CLIP+GAN 的方法 。 Aug CLIP 分能通过在图像上引入随机放大功能, 2 新的初始化和过量的调度战略, 使我们能够在 GAN 空间 空间 水平 上高效的亚化、 多级数据生成数据, 数据, 将我们的数据升级化到 GDODVDD 数据生成, 通过在 GDVDrealde 的升级的生成的生成数据生成数据生成数据生成, 。