Generative models with discrete latent representations have recently demonstrated an impressive ability to learn complex high-dimensional data distributions. However, their performance relies on a long sequence of tokens per instance and a large number of codebook entries, resulting in long sampling times and considerable computation to fit the categorical posterior. To address these issues, we propose the Masked Vector Quantization (MVQ) framework which increases the representational capacity of each code vector by learning mask configurations via a stochastic winner-takes-all training regime called Multiple Hypothese Dropout (MH-Dropout). On ImageNet 64$\times$64, MVQ reduces FID in existing vector quantization architectures by up to $68\%$ at 2 tokens per instance and $57\%$ at 5 tokens. These improvements widen as codebook entries is reduced and allows for $7\textit{--}45\times$ speed-up in token sampling during inference. As an additional benefit, we find that smaller latent spaces lead to MVQ identifying transferable visual representations where multiple can be smoothly combined.
翻译:具有离散潜表层的生成模型最近表明,它们具有学习复杂高维数据分布的令人印象深刻的能力,然而,它们的性能取决于一例中一系列的象征性和大量代码簿条目,从而导致长时间的取样时间和大量计算来适应绝对的后背体。为了解决这些问题,我们提议了蒙面矢量量定量化框架,通过一个叫做多孔化赢家取胜全培训制度(MH-Dropout)的学习掩码配置来提高每个代号矢量的代表性能力。在图像Net 64$\time diskout (MH-Dropout)上,MVQ将现有矢量定量结构中的FID减少最多68$(每例2个象征性)和57$(5个符号),这些改进随着代码簿条目的减少而扩大,并允许在推断过程中将7\text{-}45\timetis快速的象征性取样。我们发现,一个额外的好处是,较小的潜在空间导致MVQ确定可转让的视觉图象显示,在多点上可以顺利合并。