Text-to-image synthesis aims to generate natural images conditioned on text descriptions. The main difficulty of this task lies in effectively fusing text information into the image synthesis process. Existing methods usually adaptively fuse suitable text information into the synthesis process with multiple isolated fusion blocks (e.g., Conditional Batch Normalization and Instance Normalization). However, isolated fusion blocks not only conflict with each other but also increase the difficulty of training (see first page of the supplementary). To address these issues, we propose a Recurrent Affine Transformation (RAT) for Generative Adversarial Networks that connects all the fusion blocks with a recurrent neural network to model their long-term dependency. Besides, to improve semantic consistency between texts and synthesized images, we incorporate a spatial attention model in the discriminator. Being aware of matching image regions, text descriptions supervise the generator to synthesize more relevant image contents. Extensive experiments on the CUB, Oxford-102 and COCO datasets demonstrate the superiority of the proposed model in comparison to state-of-the-art models \footnote{https://github.com/senmaoy/Recurrent-Affine-Transformation-for-Text-to-image-Synthesis.git}
翻译:文本到图像合成的目的是产生以文本描述为条件的自然图像。这一任务的主要困难在于将文本信息有效融入图像合成过程。现有的方法通常在适应上将合适的文本信息与多个孤立的聚合块(例如,有条件的批量正常化和场点正常化)结合到合成过程中。然而,孤立的聚合块不仅相互冲突,而且还增加了培训的难度(见补充文件第1页)。为了解决这些问题,我们提议为将所有聚合块与经常的神经网络连接起来以模拟其长期依赖性的所有聚合块的基因反转(RAT)经常性的Affie 变换(RAT) 。此外,为了提高文本和综合图像之间的语义一致性,我们将空间关注模型纳入了歧视者中。意识到相匹配的图像区域,文本说明监督生成者合成更相关的图像内容。关于CUB、Ox-102和COCO数据集的广泛实验表明拟议模型在与状态-艺术模型(http://githhubub-extimage-stimage-Stransimage-Symay-Symay-stal-straction_Symaymay-Sy-Symay-straction}