The output of text-to-image synthesis systems should be coherent, clear, photo-realistic scenes with high semantic fidelity to their conditioned text descriptions. Our Cross-Modal Contrastive Generative Adversarial Network (XMC-GAN) addresses this challenge by maximizing the mutual information between image and text. It does this via multiple contrastive losses which capture inter-modality and intra-modality correspondences. XMC-GAN uses an attentional self-modulation generator, which enforces strong text-image correspondence, and a contrastive discriminator, which acts as a critic as well as a feature encoder for contrastive learning. The quality of XMC-GAN's output is a major step up from previous models, as we show on three challenging datasets. On MS-COCO, not only does XMC-GAN improve state-of-the-art FID from 24.70 to 9.33, but--more importantly--people prefer XMC-GAN by 77.3 for image quality and 74.1 for image-text alignment, compared to three other recent models. XMC-GAN also generalizes to the challenging Localized Narratives dataset (which has longer, more detailed descriptions), improving state-of-the-art FID from 48.70 to 14.12. Lastly, we train and evaluate XMC-GAN on the challenging Open Images data, establishing a strong benchmark FID score of 26.91.
翻译:文本到图像合成系统的产出应该是连贯、清晰、摄影现实的场景,具有高度的语义忠实于其有条件的文本描述。 我们的跨模反相反反反义网络(XMC-GAN)通过尽量扩大图像和文本之间的相互信息来应对这一挑战。这是通过多重对比性亏损实现的,这些亏损记录了不同模式和现代内部的通信。 XMC-GAN使用一种关注性的自我调制生成器,它执行强力的文字图像通信,以及一个对比式的导师,它既是批评者,也是对比性学习的特征解码器。 XMC-GAN产出的质量是比以往模式迈出的一大步,我们在三个具有挑战性的数据集上展示了这一点。 在MS-CO上, XMC-GAN不仅改进了从24.70到9.33之间的最先进的FID基准,但更重要的是,人们更希望XMC-GAN+77.3的公开图像质量和74.1的图像-文字校准,与最近三个具有挑战性的数据模型相比, XMC-GANAN的输出质量质量也比更具有挑战性。