One of the major challenges in training text-to-image generation models is the need of a large number of high-quality image-text pairs. While image samples are often easily accessible, the associated text descriptions typically require careful human captioning, which is particularly time- and cost-consuming. In this paper, we propose the first work to train text-to-image generation models without any text data. Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model: the requirement of text-conditioning is seamlessly alleviated via generating text features from image features. Extensive experiments are conducted to illustrate the effectiveness of the proposed method. We obtain state-of-the-art results in the standard text-to-image generation tasks. Importantly, the proposed language-free model outperforms most existing models trained with full image-text pairs. Furthermore, our method can be applied in fine-tuning pre-trained models, which saves both training time and cost in training text-to-image generation models. Our pre-trained model obtains competitive results in zero-shot text-to-image generation on the MS-COCO dataset, yet with around only 1% of the model size and training data size relative to the recently proposed large DALL-E model.
翻译:培训文本到图像生成模型的主要挑战之一是需要大量高质量的图像-文本模型。图像样本通常容易获得,但相关的文本描述通常需要谨慎的人文字幕,这尤其耗费时间和成本。在本文中,我们建议首先在没有任何文本数据的情况下培训文本到图像生成模型。我们的方法利用了强大的经事先培训的CLIP模型的完善的多式多式语义空间:通过根据图像特征生成文本特征,对文本调控的要求得到了无缝的缓解。我们进行了广泛的实验,以说明拟议方法的有效性。我们在标准文本到图像生成任务中取得了最先进的成果,这特别耗费时间和费用。重要的是,拟议的无语言模型比大多数经过全面图像-文本配对培训的现有模型要好。此外,我们的方法可以用于微调经过培训的预先培训模型,这既节省了培训时间,也节省了培训文本到模拟模型的成本。我们经过培训的模型仅获得零式文本模型的竞争性结果,我们获得了在标准文本到图像生成标准中获得的先进结果。重要的是,拟议的无语言模型比重数据生成的MS-D型数据,仅在M-D-D-D-D-D-D-D-D-D-D-A-D-D-D-D-D-D-D-D-D-D-D-D-D-D-A-D-D-D-D-D-D-D-D-A-D-D-D-D-D-A-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D