This is an exploratory study that discovers the current image quantization (vector quantization) do not satisfy translation equivariance in the quantized space due to aliasing. Instead of focusing on anti-aliasing, we propose a simple yet effective way to achieve translation-equivariant image quantization by enforcing orthogonality among the codebook embeddings. To explore the advantages of translation-equivariant image quantization, we conduct three proof-of-concept experiments with a carefully controlled dataset: (1) text-to-image generation, where the quantized image indices are the target to predict, (2) image-to-text generation, where the quantized image indices are given as a condition, (3) using a smaller training set to analyze sample efficiency. From the strictly controlled experiments, we empirically verify that the translation-equivariant image quantizer improves not only sample efficiency but also the accuracy over VQGAN up to +11.9% in text-to-image generation and +3.9% in image-to-text generation.
翻译:这是一项探索性研究,发现当前图像量化( Vectictor 量化) 无法满足量化空间因别名而出现的翻译差异性。 我们建议了一种简单而有效的方法,通过在代码簿嵌入中强制执行正方位化,实现翻译- 等式图像量化。 为了探索翻译- 等式图像量化的优势,我们用一个仔细控制的数据集进行三次概念验证实验:(1) 文本到图像生成,其中量化图像指数是预测目标,(2) 图像到文本生成,其中量化图像指数被设定为条件,(3) 使用一个较小的培训设置来分析样本效率。 根据严格控制的实验,我们通过实验核实翻译- 等式图像孔化器不仅提高了样本效率,而且提高了VQGAN的精度,在文本到图像生成中达到+11.9%,在图像到生成中达到+3.9%。</s>