Synthesizing high-quality realistic images from text descriptions is a challenging task. Almost all existing text-to-image Generative Adversarial Networks employ stacked architecture as the backbone. They utilize cross-modal attention mechanisms to fuse text and image features, and introduce extra networks to ensure text-image semantic consistency. In this work, we propose a much simpler, but more effective text-to-image model than previous works. Corresponding to the above three limitations, we propose: 1) a novel one-stage text-to-image backbone which is able to synthesize high-quality images directly by one pair of generator and discriminator, 2) a novel fusion module called deep text-image fusion block which deepens the text-image fusion process in generator, 3) a novel target-aware discriminator composed of matching-aware gradient penalty and one-way output which promotes the generator to synthesize more realistic and text-image semantic consistent images without introducing extra networks. Compared with existing text-to-image models, our proposed method (i.e., DF-GAN) is simpler but more efficient to synthesize realistic and text-matching images and achieves better performance. Extensive experiments on both Caltech-UCSD Birds 200 and COCO datasets demonstrate the superiority of the proposed model in comparison to state-of-the-art models.
翻译:从文本描述中合成高质量现实图像是一项艰巨的任务。几乎所有现有文本到图像的模拟生成反反转网络都使用堆叠结构作为主干。它们使用引信文本和图像特性的交叉式关注机制,并引入额外的网络以确保文本图像的语义一致性。在这项工作中,我们提出了一个比以往工作更简单、但更有效的文本到图像模型模型。与上述三个限制相对应,我们提议:(1) 一个新的一至级文本到图像主干,能够直接用一对发电机和导师合成高质量图像;(2) 一个叫深文本图像聚合块的新型聚合模块,它加深了生成器中的文本图像聚合过程;(3) 一个新颖的目标识别歧视器,它由匹配的梯度罚款和单向输出组成,它能促进生成者合成更现实和文本图像模型一致,而不会引入额外的网络。 与现有的文本到现有的图像模拟模型(即,DF-GAN-A模型)相比,一个叫做深度文本图像聚合模块,它能更简单,但能更高效地展示CSDA-C-C-C-C-C-C-C-C-C-C-C-C-SAL-SAR-SAR-SAR-SAR-SAR-SAR-SAR-SD-SD-SB-SB-SAR-C-C-C-C-C-C-C-C-C-C-SAR-SAR-S-S-S-S-C-C-S-S-S-S-S-S-SAR-SB-SL-SD-SD-S-SD-SD-SD-S-S-S-S-S-S-SD-I-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-SD-SD-SD-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-