Semantic image synthesis enables control over unconditional image generation by allowing guidance on what is being generated. We conditionally synthesize the latent space from a vector quantized model (VQ-model) pre-trained to autoencode images. Instead of training an autoregressive Transformer on separately learned conditioning latents and image latents, we find that jointly learning the conditioning and image latents significantly improves the modeling capabilities of the Transformer model. While our jointly trained VQ-model achieves a similar reconstruction performance to a vanilla VQ-model for both semantic and image latents, tying the two modalities at the autoencoding stage proves to be an important ingredient to improve autoregressive modeling performance. We show that our model improves semantic image synthesis using autoregressive models on popular semantic image datasets ADE20k, Cityscapes and COCO-Stuff.
翻译:语义图像合成能够通过对正在生成的图像提供指导来控制无条件图像生成。 我们有条件地将矢量定量模型(VQ型号)的潜层空间合成为自动编码图像的预培训。 我们发现,联合学习调节和图像潜层可以大大提高变异模型的建模能力。 虽然我们共同培训的VQ型号实现了类似于语义和图像潜层的VQ型号的重建性能,但是在自动编码阶段将两种模式捆绑起来证明是提高自动递增建模性能的重要成份。 我们显示,我们的模型用流行语义图像数据设置ADE20k、城市景景和CO-Stuff的自动递增模型改进语义图像合成。