Prior works about text-to-image synthesis typically concatenated the sentence embedding with the noise vector, while the sentence embedding and the noise vector are two different factors, which control the different aspects of the generation. Simply concatenating them will entangle the latent factors and encumber the generative model. In this paper, we attempt to decompose these two factors and propose Factor Decomposed Generative Adversarial Networks~(FDGAN). To achieve this, we firstly generate images from the noise vector and then apply the sentence embedding in the normalization layer for both generator and discriminators. We also design an additive norm layer to align and fuse the text-image features. The experimental results show that decomposing the noise and the sentence embedding can disentangle latent factors in text-to-image synthesis, and make the generative model more efficient. Compared with the baseline, FDGAN can achieve better performance, while fewer parameters are used.
翻译:在文本到图像合成中,先前的作品一般将句子嵌入和噪音向量进行连接,而句子嵌入和噪音向量是不同的因子,控制着生成的不同方面。简单地将它们连接起来会使潜在因子纠缠在一起,拖累生成模型。在本文中,我们试图将这两个因子进行分解,并提出基于因子分解的生成对抗网络(FDGAN)。为了实现这一点,我们首先从噪音向量生成图像,然后在归一化层中应用句子嵌入作为生成器和判别器的输入。同时,我们还设计了一种附加归一化层,以对齐和融合文本-图像特征。实验结果表明,将噪声和句子嵌入分解可以解开文本到图像合成中的隐变量因子,并使生成模型更加高效。与基准模型相比,FDGAN可以在使用更少的参数的同时实现更好的性能。