Recently, vector quantized autoregressive (VQ-AR) models have shown remarkable results in text-to-image synthesis by equally predicting discrete image tokens from the top left to bottom right in the latent space. Although the simple generative process surprisingly works well, is this the best way to generate the image? For instance, human creation is more inclined to the outline-to-fine of an image, while VQ-AR models themselves do not consider any relative importance of each component. In this paper, we present a progressive denoising model for high-fidelity text-to-image image generation. The proposed method takes effect by creating new image tokens from coarse to fine based on the existing context in a parallel manner and this procedure is recursively applied until an image sequence is completed. The resulting coarse-to-fine hierarchy makes the image generation process intuitive and interpretable. Extensive experiments demonstrate that the progressive model produces significantly better results when compared with the previous VQ-AR method in FID score across a wide variety of categories and aspects. Moreover, the text-to-image generation time of traditional AR increases linearly with the output image resolution and hence is quite time-consuming even for normal-size images. In contrast, our approach allows achieving a better trade-off between generation quality and speed.
翻译:最近,矢量量化的自动递减模型(VQ-AR)在文本到图像合成中显示了显著的结果,在潜层空间中,通过同样预测左上至右下离散图像符号,在文本到图像合成中显示了显著的结果。虽然简单的基因化过程效果令人惊讶,但这是生成图像的最佳方法吗?例如,人类创造更倾向于图像的轮廓到线条,而VQ-AR模型本身并不认为每个组成部分具有相对重要性。在本文中,我们为高不端文本到图像生成提供了一个渐进的分辨模型。拟议方法的效果是,在现有背景下以平行的方式从粗微到细地创建新的图像符号,而这一程序在图像序列完成之前是循环应用的。由此形成的粗微到平整的等级使得图像生成过程不易懂和可解释。广泛的实验表明,进步模型与前的VQ-AR方法相比,在各种类别和方面都取得了显著的更好效果。此外,文本到图像生成的温度比重到精细的图像在常规的图像生成之间可以实现更好的水平。