Text-to-image synthesis refers to generating visual-realistic and semantically consistent images from given textual descriptions. Previous approaches generate an initial low-resolution image and then refine it to be high-resolution. Despite the remarkable progress, these methods are limited in fully utilizing the given texts and could generate text-mismatched images, especially when the text description is complex. We propose a novel Fine-grained text-image Fusion based Generative Adversarial Networks, dubbed FF-GAN, which consists of two modules: Fine-grained text-image Fusion Block (FF-Block) and Global Semantic Refinement (GSR). The proposed FF-Block integrates an attention block and several convolution layers to effectively fuse the fine-grained word-context features into the corresponding visual features, in which the text information is fully used to refine the initial image with more details. And the GSR is proposed to improve the global semantic consistency between linguistic and visual features during the refinement process. Extensive experiments on CUB-200 and COCO datasets demonstrate the superiority of FF-GAN over other state-of-the-art approaches in generating images with semantic consistency to the given texts.Code is available at https://github.com/haoranhfut/FF-GAN.
翻译:文本到图像合成是指从给定文本描述中生成视觉现实和语义一致的图像。 以往的方法产生初始的低分辨率图像,然后将其改进为高分辨率。 尽管取得了显著的进展, 这些方法在充分利用给定文本方面是有限的, 并且可以生成文本匹配的图像, 特别是在文本描述复杂的情况下。 我们提议了一个新的精细的文本图像聚合, 其基础是Genemental Aversarial 网络, 名称为FF- GAN, 由两个模块组成: 精细微的文本模拟组合块(FF- Block) 和 Global Semanical Refination (GSR ) 。 拟议的FF- Block 整合了一个关注块和若干进化层, 以便有效地将精细的文字特征结合到相应的视觉特征中。 我们建议文本信息被充分用来以更详细的方式改进最初的图像。 GSR 提议在精细过程中提高全球语言和视觉特征的一致性。 在 CUB- 200 和COOO- 数据配置中, 高级的图像生成G- 。