Generative transformers have experienced rapid popularity growth in the computer vision community in synthesizing high-fidelity and high-resolution images. The best generative transformer models so far, however, still treat an image naively as a sequence of tokens, and decode an image sequentially following the raster scan ordering (i.e. line-by-line). We find this strategy neither optimal nor efficient. This paper proposes a novel image synthesis paradigm using a bidirectional transformer decoder, which we term MaskGIT. During training, MaskGIT learns to predict randomly masked tokens by attending to tokens in all directions. At inference time, the model begins with generating all tokens of an image simultaneously, and then refines the image iteratively conditioned on the previous generation. Our experiments demonstrate that MaskGIT significantly outperforms the state-of-the-art transformer model on the ImageNet dataset, and accelerates autoregressive decoding by up to 64x. Besides, we illustrate that MaskGIT can be easily extended to various image editing tasks, such as inpainting, extrapolation, and image manipulation.
翻译:生成变异器在计算机视觉群集中,在合成高纤维和高分辨率图像的过程中经历了快速的流行增长。 然而,迄今为止,最好的基因变异器模型仍然将图像天真地作为象征序列处理,并依光线扫描顺序(即逐行)对图像进行解码。我们发现这一策略既不最佳,也效率不高。本文建议采用一个新型图像合成模式,使用双向变异器解码器(我们称之为 MaskGIT ) 。在培训期间,MaskGIT 学会通过在各种方向上使用符号来随机预测遮盖的标志。在推断时间,模型开始于同时生成图像的所有符号,然后对前一代的迭代条件进行改进。我们的实验表明,MskGIT 明显超越了图像网络集上的最新变异模型,并加速到64x的自动变异化。 此外,我们说明,MaskGIT 可以很容易地扩展到各种图像编辑任务,例如操纵、外加。