A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions. Built upon the recent advances in generative adversarial networks (GANs), existing T2I models have made great progress. However, a close inspection of their generated images reveals two major limitations: (1) The condition batch normalization methods are applied on the whole image feature maps equally, ignoring the local semantics; (2) The text encoder is fixed during training, which should be trained with the image generator jointly to learn better text representations for image generation. To address these limitations, we propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information. Concretely, we introduce a novel Semantic-Spatial Aware Convolution Network, which (1) learns semantic-adaptive transformation conditioned on text to effectively fuse text features and image features, and (2) learns a mask map in a weakly-supervised way that depends on the current text-image fusion process in order to guide the transformation spatially. Experiments on the challenging COCO and CUB bird datasets demonstrate the advantage of our method over the recent state-of-the-art approaches, regarding both visual fidelity and alignment with input text description.
翻译:图像生成的文本( T2I) 模型旨在生成符合文字描述的图像和现实图像。 在现有T2I 模型的基因对抗网络( GANs) 最新进展的基础上, 现有T2I 模型取得了巨大进展。 然而, 对其生成的图像的仔细检查揭示出两大局限性:(1) 条件批次正常化方法同样适用于整个图像特征地图,忽视当地的语义;(2) 文本编码器在培训期间固定下来,应当与图像生成器共同培训,以学习更好的图像生成的文本表达方式。 为了应对这些局限性,我们提议建立一个新型的语义- 空间认知GAN框架,该框架经过端至端培训,使文本编码器能够利用更好的文本信息。 具体地说,我们引入了一个新的语义- 批次正常化方法, 无视当地语义学;(2) 学习语言的语义- 适应性转换条件,以有效整合文本特征和图像特征, 学习一个以薄弱的超强的图像映射图,这取决于当前文本- 图像整合的直观- 实验方法, 展示了我们空间- 的图像转换方法 。