As a challenging task, text-to-image generation aims to generate photo-realistic and semantically consistent images according to the given text descriptions. Existing methods mainly extract the text information from only one sentence to represent an image and the text representation effects the quality of the generated image well. However, directly utilizing the limited information in one sentence misses some key attribute descriptions, which are the crucial factors to describe an image accurately. To alleviate the above problem, we propose an effective text representation method with the complements of attribute information. Firstly, we construct an attribute memory to jointly control the text-to-image generation with sentence input. Secondly, we explore two update mechanisms, sample-aware and sample-joint mechanisms, to dynamically optimize a generalized attribute memory. Furthermore, we design an attribute-sentence-joint conditional generator learning scheme to align the feature embeddings among multiple representations, which promotes the cross-modal network training. Experimental results illustrate that the proposed method obtains substantial performance improvements on both the CUB (FID from 14.81 to 8.57) and COCO (FID from 21.42 to 12.39) datasets.
翻译:作为具有挑战性的任务,文本到图像的生成旨在根据给定文本描述生成符合照片现实和语义的图像。现有方法主要是从仅一个句子中提取文本信息,以代表图像和文本表示对生成图像的质量产生良好影响。然而,直接使用一个句子中的有限信息忽略了某些关键属性描述,而这些描述是准确描述图像的关键因素。为了缓解上述问题,我们提出了一个有效的文本表述方法,并补充属性信息。首先,我们构建一个属性内存,以共同控制文本到图像生成并输入句子。第二,我们探索两个更新机制,即样本认知和样本-联合机制,以动态优化通用属性内存。此外,我们设计了一个属性-联合有条件的生成者学习计划,将功能嵌入多个表达器,以促进跨模式网络培训。实验结果表明,拟议方法在CUB(FID从14.81到8.57年)和CO(FID从21.42到12.39)数据集的性能显著改进。