Masked image modeling has demonstrated great potential to eliminate the label-hungry problem of training large-scale vision Transformers, achieving impressive performance on various downstream tasks. In this work, we propose a unified view of masked image modeling after revisiting existing methods. Under the unified view, we introduce a simple yet effective method, termed as MaskDistill, which reconstructs normalized semantic features from teacher models at the masked positions, conditioning on corrupted input images. Experimental results on image classification and semantic segmentation show that MaskDistill achieves comparable or superior performance than state-of-the-art methods. When using the huge vision Transformer and pretraining 300 epochs, MaskDistill obtains 88.3% fine-tuning top-1 accuracy on ImageNet-1k (224 size) and 58.8% semantic segmentation mIoU metric on ADE20k (512 size). The code and pretrained models will be available at https://aka.ms/unimim.
翻译:蒙面图像建模展示了巨大的潜力,可以消除培训大型视觉变异器的标签饥饿问题,从而在各种下游任务中取得令人印象深刻的业绩。 在这项工作中,我们提出在重新审视现有方法后对蒙面图像建模的统一观点。 在统一观点下,我们引入了一个简单而有效的方法,称为Mask Distilling,该方法根据蒙面位置的教师模型重建了标准化的语义特征,以腐败的输入图像为条件。图像分类和语义分割的实验结果显示,蒙面变异器的性能可比或优于最先进的方法。在使用大型视觉变异器和对300个小行星进行预培训时,Mask Distillstilling在图像Net-1(224大小)和58.8%的ADE20k(512大小)的语义分解 mIOU衡量标准上获得了88.3%的精准度。 代码和预先培训模型将在https://aka.ms/unimimimimm上查阅。