Masked image modeling (MIM) has demonstrated impressive results in self-supervised representation learning by recovering corrupted image patches. However, most existing studies operate on low-level image pixels, which hinders the exploitation of high-level semantics for representation models. In this work, we propose to use a semantic-rich visual tokenizer as the reconstruction target for masked prediction, providing a systematic way to promote MIM from pixel-level to semantic-level. Specifically, we propose vector-quantized knowledge distillation to train the tokenizer, which discretizes a continuous semantic space to compact codes. We then pretrain vision Transformers by predicting the original visual tokens for the masked image patches. Furthermore, we introduce a patch aggregation strategy which associates discrete image patches to enhance global semantic representation. Experiments on image classification and semantic segmentation show that BEiT v2 outperforms all compared MIM methods. On ImageNet-1K (224 size), the base-size BEiT v2 achieves 85.5% top-1 accuracy for fine-tuning and 80.1% top-1 accuracy for linear probing. The large-size BEiT v2 obtains 87.3% top-1 accuracy for ImageNet-1K (224 size) fine-tuning, and 56.7% mIoU on ADE20K for semantic segmentation. The code and pretrained models are available at https://aka.ms/beitv2.
翻译:蒙面图像模型( MIM) 通过恢复被腐蚀的图像补丁, 在自我监督的代表学习中展示了令人印象深刻的成果。 然而, 大部分现有研究在低层次图像像素上运作, 这阻碍了对高层次图像模型的利用。 在这项工作中, 我们提议使用一个语义丰富的视觉代谢器作为遮蔽预测的重建目标, 提供一个系统化的方法, 从像素层面到语义层面, 推广MIM。 具体地说, 我们提议通过矢量定量化知识蒸馏来培训代号器, 该代号将连续的语义空间分解成紧凑的代码。 我们然后通过预测隐藏图像模型模型模型的原始视觉符号来操作。 此外, 我们提出一个补丁聚合战略, 将离析式图像补丁补丁补丁补丁作为重建全球语义代表器的重建目标, 为图像分类和语义分解部分提供系统化实验, 显示BeiT v20 级模型比MIM方法都高出。 在图像Net-1- k (224 尺寸) 上, Beigister- big- big- big- bas- basyal- prain IP- brealaliz- breal- breal- brealizmalation%