Vector quantized variational autoencoder (VQ-VAE) is a discrete auto-encoder that compresses images into discrete tokens. It is difficult to train due to discretization. In this paper, we propose a simple yet effective technique, dubbed Gaussian Quant (GQ), that converts a Gaussian VAE with certain constraint into a VQ-VAE without training. GQ generates random Gaussian noise as a codebook and finds the closest noise to the posterior mean. Theoretically, we prove that when the logarithm of the codebook size exceeds the bits-back coding rate of the Gaussian VAE, a small quantization error is guaranteed. Practically, we propose a heuristic to train Gaussian VAE for effective GQ, named target divergence constraint (TDC). Empirically, we show that GQ outperforms previous VQ-VAEs, such as VQGAN, FSQ, LFQ, and BSQ, on both UNet and ViT architectures. Furthermore, TDC also improves upon previous Gaussian VAE discretization methods, such as TokenBridge. The source code is provided in https://github.com/tongdaxu/VQ-VAE-from-Gaussian-VAE.
翻译:向量量化变分自编码器(VQ-VAE)是一种将图像压缩为离散标记的离散自编码器。由于其离散化特性,该模型训练难度较大。本文提出了一种简单而有效的技术,称为高斯量化(GQ),它能够将具有特定约束的高斯变分自编码器转换为VQ-VAE,而无需额外训练。GQ通过生成随机高斯噪声作为码本,并寻找与后验均值最接近的噪声向量。理论上,我们证明当码本大小的对数超过高斯变分自编码器的比特回传编码率时,可以保证较小的量化误差。在实际应用中,我们提出了一种启发式方法,称为目标散度约束(TDC),用于训练高斯变分自编码器以实现有效的GQ。实验结果表明,在UNet和ViT架构上,GQ的性能优于以往的VQ-VAE模型,如VQGAN、FSQ、LFQ和BSQ。此外,TDC也改进了先前的高斯变分自编码器离散化方法,如TokenBridge。源代码发布于https://github.com/tongdaxu/VQ-VAE-from-Gaussian-VAE。