In this paper, we focus on the semantic image synthesis task that aims at transferring semantic label maps to photo-realistic images. Existing methods lack effective semantic constraints to preserve the semantic information and ignore the structural correlations in both spatial and channel dimensions, leading to unsatisfactory blurry and artifact-prone results. To address these limitations, we propose a novel Dual Attention GAN (DAGAN) to synthesize photo-realistic and semantically-consistent images with fine details from the input layouts without imposing extra training overhead or modifying the network architectures of existing methods. We also propose two novel modules, i.e., position-wise Spatial Attention Module (SAM) and scale-wise Channel Attention Module (CAM), to capture semantic structure attention in spatial and channel dimensions, respectively. Specifically, SAM selectively correlates the pixels at each position by a spatial attention map, leading to pixels with the same semantic label being related to each other regardless of their spatial distances. Meanwhile, CAM selectively emphasizes the scale-wise features at each channel by a channel attention map, which integrates associated features among all channel maps regardless of their scales. We finally sum the outputs of SAM and CAM to further improve feature representation. Extensive experiments on four challenging datasets show that DAGAN achieves remarkably better results than state-of-the-art methods, while using fewer model parameters. The source code and trained models are available at https://github.com/Ha0Tang/DAGAN.
翻译:在本文中,我们侧重于语义图像合成任务,目的是将语义标签图转换成摄影现实图像; 现有方法缺乏有效的语义限制,无法保存语义信息,忽视空间和频道两个层面的结构相关性,导致模糊性和人工制品易变结果不尽人意; 为解决这些局限性,我们提出一个新的双关注GAN(DAGAN),以综合来自输入布局的相光现实和语义一致的图像,同时不强加额外的培训管理或修改现有方法的网络结构。 我们还提议两个新型模块,即定位为空间关注模块(SAM)和比例为频道关注模块(CAM),以分别在空间和频道两个层面捕捉语义结构的注意。具体地说,SAM有选择地将每个位置的像素通过空间关注地图(DAN)连接起来,导致相像义标签与彼此关联的语义模型(无论空间距离如何。 与此同时,CAM有选择地强调每个频道关注地图的大小特征,我们在SAMA 4 中更精确地展示了Slaimal输出。