We develop an approach for text-to-image generation that embraces additional retrieval images, driven by a combination of implicit visual guidance loss and generative objectives. Unlike most existing text-to-image generation methods which merely take the text as input, our method dynamically feeds cross-modal search results into a unified training stage, hence improving the quality, controllability and diversity of generation results. We propose a novel hypernetwork modulated visual-text encoding scheme to predict the weight update of the encoding layer, enabling effective transfer from visual information (e.g. layout, content) into the corresponding latent domain. Experimental results show that our model guided with additional retrieval visual data outperforms existing GAN-based models. On COCO dataset, we achieve better FID of $9.13$ with up to $3.5 \times$ fewer generator parameters, compared with the state-of-the-art method.
翻译:我们为文本到图像的生成制定了一种方法,它包含额外的检索图像,其驱动力是隐含的视觉指导丢失和基因化目标的组合。与大多数现有的文本到图像生成方法不同的是,我们的方法只是将文本作为输入,我们的方法将跨模式搜索结果动态地输入到一个统一的培训阶段,从而改进生成结果的质量、可控性和多样性。我们提议了一个新型的超网络调制视觉文本编码系统,以预测编码层的重量更新,从而能够有效地从视觉信息(例如布局、内容)转移到相应的潜在领域。实验结果显示,我们的模型以额外的检索视觉数据为指南,比现有的GAN模型更完善。在COCO数据集上,我们实现了913美元的更好FID,比最先进的方法少了35美元,比最先进的发电机参数少了1美元。