Self-supervised learning for computer vision has achieved tremendous progress and improved many downstream vision tasks such as image classification, semantic segmentation, and object detection. Among these, generative self-supervised vision learning approaches such as MAE and BEiT show promising performance. However, their global masked reconstruction mechanism is computationally demanding. To address this issue, we propose local masked reconstruction (LoMaR), a simple yet effective approach that performs masked reconstruction within a small window of 7$\times$7 patches on a simple Transformer encoder, improving the trade-off between efficiency and accuracy compared to global masked reconstruction over the entire image. Extensive experiments show that LoMaR reaches 84.1% top-1 accuracy on ImageNet-1K classification, outperforming MAE by 0.5%. After finetuning the pretrained LoMaR on 384$\times$384 images, it can reach 85.4% top-1 accuracy, surpassing MAE by 0.6%. On MS COCO, LoMaR outperforms MAE by 0.5 $\text{AP}^\text{box}$ on object detection and 0.5 $\text{AP}^\text{mask}$ on instance segmentation. LoMaR is especially more computation-efficient on pretraining high-resolution images, e.g., it is 3.1$\times$ faster than MAE with 0.2% higher classification accuracy on pretraining 448$\times$448 images. This local masked reconstruction learning mechanism can be easily integrated into any other generative self-supervised learning approach. Our code will be publicly available.
翻译:计算机视觉自监督学习取得了巨大的进步,并改进了许多下游视觉任务,如图像分类、语义分解和对象探测。 其中,像MAE和BeiT这样的自监督自监督的视觉学习方法具有良好的业绩。 但是,它们的全球遮蔽重建机制在计算上要求很高。 为了解决这个问题,我们建议采用一个简单而有效的方法,在一个7美元的时间小窗口内进行遮蔽式重建(LoMaR),在一个简单的变换器编码器编码器上的7美元7个补丁,与整个图像的全球蒙面重建相比,提高了效率和准确性之间的权衡。广泛的实验显示,LoMaR在图像Net-K分类中达到84.1%的顶级-1准确性,比MA-IK值高0.5 %。在对LoMaR进行384的校正调整后,它可以达到85.4%的顶级1精确度,比MAE高出0.6 %。在目标检测和高级的自我学习中,在0.5美元/APxxxx前的自我学习机制中将比0.5美元。