DiverGAN: 多种文字到图像生成的高效和有效的单一标准框架 (DiverGAN: An Efficient and Effective Single-Stage Framework for Diverse Text-to-Image Generation)

In this paper, we present an efficient and effective single-stage framework (DiverGAN) to generate diverse, plausible and semantically consistent images according to a natural-language description. DiverGAN adopts two novel word-level attention modules, i.e., a channel-attention module (CAM) and a pixel-attention module (PAM), which model the importance of each word in the given sentence while allowing the network to assign larger weights to the significant channels and pixels semantically aligning with the salient words. After that, Conditional Adaptive Instance-Layer Normalization (CAdaILN) is introduced to enable the linguistic cues from the sentence embedding to flexibly manipulate the amount of change in shape and texture, further improving visual-semantic representation and helping stabilize the training. Also, a dual-residual structure is developed to preserve more original visual features while allowing for deeper networks, resulting in faster convergence speed and more vivid details. Furthermore, we propose to plug a fully-connected layer into the pipeline to address the lack-of-diversity problem, since we observe that a dense layer will remarkably enhance the generative capability of the network, balancing the trade-off between a low-dimensional random latent code contributing to variants and modulation modules that use high-dimensional and textual contexts to strength feature maps. Inserting a linear layer after the second residual block achieves the best variety and quality. Both qualitative and quantitative results on benchmark data sets demonstrate the superiority of our DiverGAN for realizing diversity, without harming quality and semantic consistency.

翻译：在本文中,我们展示了一个高效和有效的单阶段质量框架(DiverGAN),以便根据自然语言描述生成多样、可信和语义一致的定性图像。DiverGAN采用了两个新颖的字级关注模块,即频道关注模块(CAM)和一个像素关注模块(PAM),该模块将每个词在给定句中的重要性建模,同时使网络能够给重要渠道分配更大的权重,使像素的精度与突出的词句相匹配。之后,引入了调试性平整流(CAdaILN),以使句中的语言提示能够灵活地调整形状和纹质的变化数量,进一步改善视觉-感知模块和像素关注模块。此外,正在开发一个双重结构,以保持更原始的视觉特征,同时允许更深的网络,从而更快的趋同速度和更清晰的细节。此外,我们提议在最佳管道中插入一个完全连接的层层,以解决生物多样性缺失的区块状正常化问题,因为我们观察到了高层次的网络和低层结构,将大大地促进着高层次的内层结构,从而稳定地提升了我们所处的网络的深度的深度结构,从而将提升了深度的内层平流数据结构将提升到低层,从而将提升到高层的内层的内层平流数据结构。