Text-to-image (T2I) models based on diffusion processes have achieved remarkable success in controllable image generation using user-provided captions. However, the tight coupling between the current text encoder and image decoder in T2I models makes it challenging to replace or upgrade. Such changes often require massive fine-tuning or even training from scratch with the prohibitive expense. To address this problem, we propose GlueGen, which applies a newly proposed GlueNet model to align features from single-modal or multi-modal encoders with the latent space of an existing T2I model. The approach introduces a new training objective that leverages parallel corpora to align the representation spaces of different encoders. Empirical results show that GlueNet can be trained efficiently and enables various capabilities beyond previous state-of-the-art models: 1) multilingual language models such as XLM-Roberta can be aligned with existing T2I models, allowing for the generation of high-quality images from captions beyond English; 2) GlueNet can align multi-modal encoders such as AudioCLIP with the Stable Diffusion model, enabling sound-to-image generation; 3) it can also upgrade the current text encoder of the latent diffusion model for challenging case generation. By the alignment of various feature representations, the GlueNet allows for flexible and efficient integration of new functionality into existing T2I models and sheds light on X-to-image (X2I) generation.
翻译:文字到图像(T2I)模型基于扩散过程,在使用用户提供的标题进行可控图像生成方面取得了显着成功。然而,T2I模型中当前文本编码器和图像解码器之间的紧密耦合使得更换或升级变得具有挑战性。这种变化常常需要经过大量微调或甚至需要从零开始进行训练,这是非常困难的。为了解决这个问题,我们提出了GlueGen,它将一个新提出的GlueNet模型应用于将单模态或多模态编码器的功能特征与现有T2I模型的潜在空间进行对齐。该方法引入了一种新的训练目标,利用并行语料库来对齐不同编码器的表示空间。实验结果表明,GlueNet可以高效地进行训练,并能够实现超越先前最先进模型的各种功能:1)多语言语言模型,如XLM-Roberta,可以与现有的T2I模型对齐,允许从英语以外的标题生成高质量的图像;2)GlueNet可以将多模态编码器(如AudioCLIP)与稳定扩散模型对齐,实现声音到图像的生成;3)它还可以升级潜在扩散模型的当前文本编码器,以便进行具有挑战性的案例生成。通过对各种特征表示的对齐,GlueNet允许将新功能灵活高效地集成到现有的T2I模型中,并照亮X到图像(X2I)生成的前景。