A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions. Built upon the recent advances in generative adversarial networks (GANs), existing T2I models have made great progress. However, a close inspection of their generated images reveals two major limitations: (1) The condition batch normalization methods are applied on the whole image feature maps equally, ignoring the local semantics; (2) The text encoder is fixed during training, which should be trained with the image generator jointly to learn better text representations for image generation. To address these limitations, we propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information. Concretely, we introduce a novel Semantic-Spatial Aware Convolution Network, which (1) learns semantic-adaptive transformation conditioned on text to effectively fuse text features and image features, and (2) learns a mask map in a weakly-supervised way that depends on the current text-image fusion process in order to guide the transformation spatially. Experiments on the challenging COCO and CUB bird datasets demonstrate the advantage of our method over the recent state-of-the-art approaches, regarding both visual fidelity and alignment with input text description. Code is available at https://github.com/wtliao/text2image.
翻译:图像生成的文本( T2I) 模型旨在生成与文本描述相一致的图像和现实图像。 借助基因对抗网络( GANs) 的最新进展, 现有的T2I 模型取得了巨大进展。 然而, 对其生成的图像的仔细检查揭示了两大局限性:(1) 条件批次正常化方法同样适用于整个图像特征地图,忽视了当地的语义;(2) 文本编码器在培训期间固定下来,与图像生成者共同培训,以学习更好的图像生成的文本表达方式。 为了应对这些局限性,我们提议建立一个新型的语义- 空间认知GAN框架,这个框架以端到端的方式培训,使文本编码编码者能够利用更好的文本信息。 具体地说,我们引入了一个新的语义- 批次正常化方法, 学习语言上的语义- 适应性转换条件, 以有效连接文本特性和图像生成的特征, 以及 学习一个以薄弱的、 超强的地图, 取决于当前文本- 模拟/ 图像整合方法, 展示了我们当前视觉- 智能- 的图像转换方法 。