GALIP: 用于文本到图像合成的生成反反读 CLIP (GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis)

Synthesizing high-fidelity complex images from text is challenging. Based on large pretraining, the autoregressive and diffusion models can synthesize photo-realistic images. Although these large models have shown notable progress, there remain three flaws. 1) These models require tremendous training data and parameters to achieve good performance. 2) The multi-step generation design slows the image synthesis process heavily. 3) The synthesized visual features are difficult to control and require delicately designed prompts. To enable high-quality, efficient, fast, and controllable text-to-image synthesis, we propose Generative Adversarial CLIPs, namely GALIP. GALIP leverages the powerful pretrained CLIP model both in the discriminator and generator. Specifically, we propose a CLIP-based discriminator. The complex scene understanding ability of CLIP enables the discriminator to accurately assess the image quality. Furthermore, we propose a CLIP-empowered generator that induces the visual concepts from CLIP through bridge features and prompts. The CLIP-integrated generator and discriminator boost training efficiency, and as a result, our model only requires about 3% training data and 6% learnable parameters, achieving comparable results to large pretrained autoregressive and diffusion models. Moreover, our model achieves 120 times faster synthesis speed and inherits the smooth latent space from GAN. The extensive experimental results demonstrate the excellent performance of our GALIP. Code is available at https://github.com/tobran/GALIP.

翻译：从文本中合成高纤维复合图像是具有挑战性的。在大量培训前,自动递减和传播模型可以合成照片现实图像。虽然这些大型模型已经显示出显著的进展,但仍存在三个缺陷。 1)这些模型需要大量的培训数据和参数才能取得良好的性能。 2)多阶段生成设计大大减缓了图像合成过程。 2)多阶段生成设计使图像合成过程缓慢。3)合成的视觉特征难以控制,需要精密设计的提示。为了能够实现高质量、高效、快速和可控的文本到图像合成,我们提议“GalLIP ” 创制自动缩放化 CLIP,即GALIP 。GALIP 利用强大的预先训练的 CLIP 模型在制导器和生成器中都取得了显著的进步。我们提议了一个基于CLIP 的复杂场景点理解能力,使导师能够准确评估图像质量。此外,我们提议了CLIP 动力发电机和可控制文本到图像合成合成的CLIP 。 CLIP 集集集化和制导力器提升培训效率,GLIP 和制导导力提高CLIP 和制导力培训效率,在导出高精化的模化的模型中,在导力化模型中,要达到120个化的模化的模化的模化的模化的模范进速度,我们的数据化的模化的模化的模化速度,要用率化后,要能速度要达到的模化的模化速度,要达到高的模,要用率。