Image BERT pre-training with masked image modeling (MIM) becomes a popular practice to cope with self-supervised representation learning. A seminal work, BEiT, casts MIM as a classification task with a visual vocabulary, tokenizing the continuous visual signals into discrete vision tokens using a pre-learned dVAE. Despite a feasible solution, the improper discretization hinders further improvements of image pre-training. Since image discretization has no ground-truth answers, we believe that the masked patch should not be assigned with a unique token id even if a better tokenizer can be obtained. In this work, we introduce an improved BERT-style image pre-training method, namely mc-BEiT, which performs MIM proxy tasks towards eased and refined multi-choice training objectives. Specifically, the multi-choice supervision for the masked image patches is formed by the soft probability vectors of the discrete token ids, which are predicted by the off-the-shelf image tokenizer and further refined by high-level inter-patch perceptions resorting to the observation that similar patches should share their choices. Extensive experiments on classification, segmentation, and detection tasks demonstrate the superiority of our method, e.g., the pre-trained ViT-B achieves 84.1% top-1 fine-tuning accuracy on ImageNet-1K classification, 49.2% AP^b and 44.0% AP^m of object detection and instance segmentation on COCO, 50.8% mIOU on ADE20K semantic segmentation, outperforming the competitive counterparts. The code will be available at https://github.com/lixiaotong97/mc-BEiT.
翻译:49. 在这项工作中,我们采用了一种改进的 BERT-sy式图像预培训方法,即Mc-BEIT,将连续的视觉信号标记成离散的视觉标志,使用预学的 dVAE。尽管有一个可行的解决方案,但不当的离散性会妨碍图像预培训的进一步改进。由于图像离散没有地面真相解析,我们认为,即使能够取得更好的符号化。在这项工作中,我们采用改进的 BERT-sy式图像预培训方法,即Mc-BEIT,将连续的视觉信号标记标记标记标记标记用成一个离散的 DVAEEEEE。尽管存在一个可行的解决方案,但是不适当的离散分化会妨碍图像预培训的进一步改进。由于离散的图像分解没有地面真相解析解析,我们认为,即使能够取得更好的符号化符号化。在高层次的 OFI-VI 级评分级上,可以取得类似的缩缩略图段评分级方法。