We propose Fast text2StyleGAN, a natural language interface that adapts pre-trained GANs for text-guided human face synthesis. Leveraging the recent advances in Contrastive Language-Image Pre-training (CLIP), no text data is required during training. Fast text2StyleGAN is formulated as a conditional variational autoencoder (CVAE) that provides extra control and diversity to the generated images at test time. Our model does not require re-training or fine-tuning of the GANs or CLIP when encountering new text prompts. In contrast to prior work, we do not rely on optimization at test time, making our method orders of magnitude faster than prior work. Empirically, on FFHQ dataset, our method offers faster and more accurate generation of images from natural language descriptions with varying levels of detail compared to prior work.
翻译:我们建议快速文本2 StyleGAN, 这是一种天然语言界面, 使经过预先训练的GANs 适应文本制导的人类面貌合成。 利用培训前语言图像对比( CLIP) 的最新进展, 培训期间不需要文本数据。 快速文本2 StyleGAN 是一个有条件的可变自动编码器( CVAE ), 它为测试时生成的图像提供了额外的控制和多样性。 我们的模型不需要在遇到新文本提示时对 GANs 或 CLIP 进行再培训或微调。 与先前的工作不同, 我们并不依靠测试时间的优化, 使我们的方法数量顺序比先前的工作要快。 在 FFHQ 数据集上, 我们的方法提供更快、更准确的自然语言描述图像生成, 与先前的工作相比, 详细程度不尽相同 。