Although two-stage Vector Quantized (VQ) generative models allow for synthesizing high-fidelity and high-resolution images, their quantization operator encodes similar patches within an image into the same index, resulting in a repeated artifact for similar adjacent regions using existing decoder architectures. To address this issue, we propose to incorporate the spatially conditional normalization to modulate the quantized vectors so as to insert spatially variant information to the embedded index maps, encouraging the decoder to generate more photorealistic images. Moreover, we use multichannel quantization to increase the recombination capability of the discrete codes without increasing the cost of model and codebook. Additionally, to generate discrete tokens at the second stage, we adopt a Masked Generative Image Transformer (MaskGIT) to learn an underlying prior distribution in the compressed latent space, which is much faster than the conventional autoregressive model. Experiments on two benchmark datasets demonstrate that our proposed modulated VQGAN is able to greatly improve the reconstructed image quality as well as provide high-fidelity image generation.
翻译:虽然两阶段矢量定量(VQ)基因化模型允许将高纤维和高分辨率图像合成,但其定量操作器将图像中的类似补丁编码为同一索引,导致使用现有解码器结构为相邻区域重复制作文物。为了解决这一问题,我们提议纳入空间条件正常化,以调整四分制矢量,从而在嵌入的索引图中插入空间变异信息,鼓励解码器生成更多具有摄影真实性的图像。此外,我们使用多通道定量操作器,在不增加模型和代码簿成本的情况下,将离散代码的再组合能力提高。此外,为了在第二阶段生成离散符号,我们采用了一个蒙面放大图像变异器(MASKGIT),以学习压缩潜层空间的原始分布,这比传统的自动递增模式要快得多。关于两个基准数据集的实验表明,我们提议的调制VQGAN能够极大地改进重建图像质量,同时提供高纤维成图像的生成。