多学习、多领导、多领导、空间动态记忆改进文本到图像合成 (Multi-Tailed, Multi-Headed, Spatial Dynamic Memory refined Text-to-Image Synthesis)

Synthesizing high-quality, realistic images from text-descriptions is a challenging task, and current methods synthesize images from text in a multi-stage manner, typically by first generating a rough initial image and then refining image details at subsequent stages. However, existing methods that follow this paradigm suffer from three important limitations. Firstly, they synthesize initial images without attempting to separate image attributes at a word-level. As a result, object attributes of initial images (that provide a basis for subsequent refinement) are inherently entangled and ambiguous in nature. Secondly, by using common text-representations for all regions, current methods prevent us from interpreting text in fundamentally different ways at different parts of an image. Different image regions are therefore only allowed to assimilate the same type of information from text at each refinement stage. Finally, current methods generate refinement features only once at each refinement stage and attempt to address all image aspects in a single shot. This single-shot refinement limits the precision with which each refinement stage can learn to improve the prior image. Our proposed method introduces three novel components to address these shortcomings: (1) An initial generation stage that explicitly generates separate sets of image features for each word n-gram. (2) A spatial dynamic memory module for refinement of images. (3) An iterative multi-headed mechanism to make it easier to improve upon multiple image aspects. Experimental results demonstrate that our Multi-Headed Spatial Dynamic Memory image refinement with our Multi-Tailed Word-level Initial Generation (MSMT-GAN) performs favourably against the previous state of the art on the CUB and COCO datasets.

翻译：从文本描述中合成高质量、现实的图像是一项具有挑战性的任务,当前的方法以多阶段方式综合文本中的图像,通常先先生成粗糙的初步图像,然后在随后的阶段完善图像细节。然而,遵循这一范例的现有方法有三大局限性。首先,它们综合初始图像,而不试图在字级上区分图像属性。因此,初始图像(为随后的改进提供基础)的客体属性本质上是纠缠不清和模糊的。第二,通过对所有区域使用共同的文本表达方式,当前的方法使我们无法在图像的不同部分以根本不同的方式解释文本。因此,不同的图像区域仅允许在每个改进阶段吸收文本中的相同类型信息。最后,当前方法仅在每个改进阶段产生一次精细化功能,并试图在单镜头中处理所有图像属性。因此,初始图像(为随后的完善提供了基础基础基础基础)的精确性属性在性质上相互纠缠缠。我们提出的方法为克服这些缺陷而引入了三种新状态:(1)初始阶段,明确为每个字型的图像制作成套的图像,以较简易的N-MB图像。最后,当前方法仅在每字级的图像中生成的图像上显示多级的图像的图像的图像。