An ongoing trend in generative modelling research has been to push sample resolutions higher whilst simultaneously reducing computational requirements for training and sampling. We aim to push this trend further via the combination of techniques - each component representing the current pinnacle of efficiency in their respective areas. These include vector-quantized GAN (VQ-GAN), a vector-quantization (VQ) model capable of high levels of lossy - but perceptually insignificant - compression; hourglass transformers, a highly scaleable self-attention model; and step-unrolled denoising autoencoders (SUNDAE), a non-autoregressive (NAR) text generative model. Unexpectedly, our method highlights weaknesses in the original formulation of hourglass transformers when applied to multidimensional data. In light of this, we propose modifications to the resampling mechanism, applicable in any task applying hierarchical transformers to multidimensional data. Additionally, we demonstrate the scalability of SUNDAE to long sequence lengths - four times longer than prior work. Our proposed framework scales to high-resolutions ($1024 \times 1024$) and trains quickly (2-4 days). Crucially, the trained model produces diverse and realistic megapixel samples in approximately 2 seconds on a consumer-grade GPU (GTX 1080Ti). In general, the framework is flexible: supporting an arbitrary number of sampling steps, sample-wise self-stopping, self-correction capabilities, conditional generation, and a NAR formulation that allows for arbitrary inpainting masks. We obtain FID scores of 10.56 on FFHQ256 - close to the original VQ-GAN in less than half the sampling steps - and 21.85 on FFHQ1024 in only 100 sampling steps.
翻译:基因建模研究的一个持续趋势是,将抽样分辨率推高100度,同时降低培训和取样的计算要求。我们的目标是通过各种技术(代表各自领域目前效率顶峰的每个组成部分)的组合,进一步推高这一趋势。这些技术包括矢量定量的GAN(VQ-GAN),一个矢量定量的GAN(VQ-GAN)模型,一个可以造成高度损失但感觉微不足道的矢量定量(VQQ)模型。压缩;沙玻璃变器,一个高度可缩放的自我注意模型;10倍调整的脱钩自动编码器(SUNDAE),一个非自动递增的文本变形模型(NARAR),一个非自动递增的文本模型(NGT),一个非自动递增的自动变压器,一个经过训练的GGGGG-G-G-G-G-G-G-G-G-G-R-在10秒内,一个经过训练的GR-在10-1024秒内,一个连续的自转式的自转式的自压步骤上,一个自转的1024x-直截式格式,在1024x-直压的样品中,在10秒内,一个经过训练的基-四列的模制的模-直立的10-直立的模-直径的模-直径的模-直立的序内。