LAVITE: 争取为产生文字到图像制作提供无语言培训 (LAFITE: Towards Language-Free Training for Text-to-Image Generation)

One of the major challenges in training text-to-image generation models is the need of a large number of high-quality image-text pairs. While image samples are often easily accessible, the associated text descriptions typically require careful human captioning, which is particularly time- and cost-consuming. In this paper, we propose the first work to train text-to-image generation models without any text data. Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model: the requirement of text-conditioning is seamlessly alleviated via generating text features from image features. Extensive experiments are conducted to illustrate the effectiveness of the proposed method. We obtain state-of-the-art results in the standard text-to-image generation tasks. Importantly, the proposed language-free model outperforms most existing models trained with full image-text pairs. Furthermore, our method can be applied in fine-tuning pre-trained models, which saves both training time and cost in training text-to-image generation models. Our pre-trained model obtains competitive results in zero-shot text-to-image generation on the MS-COCO dataset, yet with around only 1% of the model size and training data size relative to the recently proposed large DALL-E model.

翻译：培训文本到图像生成模型的主要挑战之一是需要大量高质量的图像-文本模型。图像样本通常容易获得,但相关的文本描述通常需要谨慎的人文字幕,这尤其耗费时间和成本。在本文中,我们建议首先在没有任何文本数据的情况下培训文本到图像生成模型。我们的方法利用了强大的经事先培训的CLIP模型的完善的多式多式语义空间:通过根据图像特征生成文本特征,对文本调控的要求得到了无缝的缓解。我们进行了广泛的实验,以说明拟议方法的有效性。我们在标准文本到图像生成任务中取得了最先进的成果,这特别耗费时间和费用。重要的是,拟议的无语言模型比大多数经过全面图像-文本配对培训的现有模型要好。此外,我们的方法可以用于微调经过培训的预先培训模型,这既节省了培训时间,也节省了培训文本到模拟模型的成本。我们经过培训的模型仅获得零式文本模型的竞争性结果,我们获得了在标准文本到图像生成标准中获得的先进结果。重要的是,拟议的无语言模型比重数据生成的MS-D型数据,仅在M-D-D-D-D-D-D-D-D-D-D-A-D-D-D-D-D-D-D-D-D-D-D-D-D-D-A-D-D-D-D-D-D-D-D-A-D-D-D-D-D-A-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/