Text-guided image manipulation tasks have recently gained attention in the vision-and-language community. While most of the prior studies focused on single-turn manipulation, our goal in this paper is to address the more challenging multi-turn image manipulation (MTIM) task. Previous models for this task successfully generate images iteratively, given a sequence of instructions and a previously generated image. However, this approach suffers from under-generation and a lack of generated quality of the objects that are described in the instructions, which consequently degrades the overall performance. To overcome these problems, we present a novel architecture called a Visually Guided Language Attention GAN (LatteGAN). Here, we address the limitations of the previous approaches by introducing a Visually Guided Language Attention (Latte) module, which extracts fine-grained text representations for the generator, and a Text-Conditioned U-Net discriminator architecture, which discriminates both the global and local representations of fake or real images. Extensive experiments on two distinct MTIM datasets, CoDraw and i-CLEVR, demonstrate the state-of-the-art performance of the proposed model.
翻译:以文字为指南的图像处理任务最近在视觉和语言界引起了注意。 虽然先前的大多数研究侧重于单转操纵,但我们在本文中的目标是解决更具挑战性的多转图像处理任务。 先前的任务模型根据一系列指令和先前生成的图像,成功地迭代生成图像。 但是,这种方法由于设计指令中所描述的物体的生成不足和缺乏生成质量而受到影响,从而降低了总体性能。 为了克服这些问题,我们提出了一个称为视觉引导语言注意GAN(LatteGAN)的新颖结构。 在这里,我们通过引入视觉引导语言注意模块(Latte)来解决以往方法的局限性,该模块为生成者提取精细的文字表达方式,以及一个限制假图像或真实图像的全球和本地表达方式的文本调整 U-网络歧视结构。 在两个不同的MTIM数据集( CoDraw 和i-CLEVR)上进行广泛的实验,展示了拟议模型的状态表现。