A layout to image (L2I) generation model aims to generate a complicated image containing multiple objects (things) against natural background (stuff), conditioned on a given layout. Built upon the recent advances in generative adversarial networks (GANs), existing L2I models have made great progress. However, a close inspection of their generated images reveals two major limitations: (1) the object-to-object as well as object-to-stuff relations are often broken and (2) each object's appearance is typically distorted lacking the key defining characteristics associated with the object class. We argue that these are caused by the lack of context-aware object and stuff feature encoding in their generators, and location-sensitive appearance representation in their discriminators. To address these limitations, two new modules are proposed in this work. First, a context-aware feature transformation module is introduced in the generator to ensure that the generated feature encoding of either object or stuff is aware of other co-existing objects/stuff in the scene. Second, instead of feeding location-insensitive image features to the discriminator, we use the Gram matrix computed from the feature maps of the generated object images to preserve location-sensitive information, resulting in much enhanced object appearance. Extensive experiments show that the proposed method achieves state-of-the-art performance on the COCO-Thing-Stuff and Visual Genome benchmarks.
翻译:图像版图( L2I) 生成模型( L2I) 旨在生成一个复杂的图像, 包含以自然背景( 外观) 为条件的多种对象( 外观) 。 以基因对抗网络( GANs) 的最新进展为基础, 现有的 L2I 模型取得了巨大进展 。 然而, 对其生成的图像进行仔细检查, 揭示出两大局限性:(1) 对象对对象以及对象对对象关系经常被打破, (2) 每个对象的外观通常被扭曲, 缺乏与对象类别相关的关键定义特征。 我们争辩说, 造成这些变化的原因是, 其生成器缺少上的背景觉识对象和材料的编码, 以及其导师中对位置敏感的外观代表。 为了解决这些局限性, 在这项工作中提出了两个新的模块。 首先, 在生成的图像中引入了环境觉特征转换模块, 以确保对象或事物的特性编码能够了解现场的其他共同存在的物体/ 。 其次, 我们不用将位置不敏感的图像特性定位特性特性特性定位给导师。 我们使用Gram 矩阵在生成的视野图像的地图图上进行计算, 将生成的GRAVIFIFI 显示高度图像的图像的图像的图像定位定位定位定位定位定位定位定位定位定位定位定位定位定位定位定位定位定位显示为高度显示。