Language modality within the vision language pretraining framework is innately discretized, endowing each word in the language vocabulary a semantic meaning. In contrast, visual modality is inherently continuous and high-dimensional, which potentially prohibits the alignment as well as fusion between vision and language modalities. We therefore propose to "discretize" the visual representation by joint learning a codebook that imbues each visual token a semantic. We then utilize these discretized visual semantics as self-supervised ground-truths for building our Masked Image Modeling objective, a counterpart of Masked Language Modeling which proves successful for language models. To optimize the codebook, we extend the formulation of VQ-VAE which gives a theoretic guarantee. Experiments validate the effectiveness of our approach across common vision-language benchmarks.
翻译:视觉语言培训前框架的语言模式本质上是分立的,在语言词汇中赋予每个词一个语义的含义。相反,视觉模式本质上是连续和高维的,有可能禁止视觉和语言模式之间的对齐和融合。因此,我们提议通过联合学习一个将每个视觉象征都装上语义的代码手册,“分解”视觉表达方式。然后我们利用这些分立的视觉语义作为建立我们蒙面图像建模目标的自我监督的地面真理,这是对语言模型的成功证明的蒙面语言建模的对等。为了优化代码,我们扩展了VQ-VAE的配制,该配有理论保证。实验验证了我们跨共同愿景语言基准的方法的有效性。