It has been witnessed that masked image modeling (MIM) has shown a huge potential in self-supervised learning in the past year. Benefiting from the universal backbone vision transformer, MIM learns self-supervised visual representations through masking a part of patches of the image while attempting to recover the missing pixels. Most previous works mask patches of the image randomly, which underutilizes the semantic information that is beneficial to visual representation learning. On the other hand, due to the large size of the backbone, most previous works have to spend much time on pre-training. In this paper, we propose \textbf{Attention-driven Masking and Throwing Strategy} (AMT), which could solve both problems above. We first leverage the self-attention mechanism to obtain the semantic information of the image during the training process automatically without using any supervised methods. Masking strategy can be guided by that information to mask areas selectively, which is helpful for representation learning. Moreover, a redundant patch throwing strategy is proposed, which makes learning more efficient. As a plug-and-play module for masked image modeling, AMT improves the linear probing accuracy of MAE by $2.9\% \sim 5.9\%$ on CIFAR-10/100, STL-10, Tiny ImageNet, and ImageNet-1K, and obtains an improved performance with respect to fine-tuning accuracy of MAE and SimMIM. Moreover, this design also achieves superior performance on downstream detection and segmentation tasks.
翻译:在过去一年里,蒙面图像建模(MIM)在自我监督学习方面显示出巨大的潜力。从通用主干视觉变压器中受益,MIM在试图恢复缺失像素的同时,通过遮盖图像的部分部分部分补丁来学习自我监督的视觉表现。大多数以前的工作掩面图像的修饰部分是随机的,这没有充分利用有助于视觉演示学习的语义信息。另一方面,由于主干体面积大,大多数以前的工作需要花很多时间在培训前。在本文中,我们提议使用\ textbf{注意驱动的遮掩和扔战略}(AMT)来学习自我监督的视觉表现。我们首先利用自留机制在培训过程中自动获得图像的语义信息,而没有使用任何监督的方法。遮面战略可以用这种信息来选择性地遮蔽区域,这有利于代表学习。此外,还提议了多余的补面施展战略,使学习效率更高。作为Sta-plemental-netf{ANDR)的高级性平面图像升级部分,用SIM-SIM-SIM-SIM-SIM-SIM-SIM-SIMSIMSIM-SIM-SIM-SIM-SIM-SIM-SIM-SIM-SIM-SIM-SIM-SIM-SIM-SIM-SIMSIMSIMSIM-SIM-SIM-SIM-SIM-S IM-SIM-SIM-SIM-SIM IM IM IM IM IM IM IM IM IMSIM IM IMSIMSIMSIMAR IM IM IM-S IM-S-S-S IM-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-SIM IM IML IM IM IM IM IM IM IM IM IM IM IM IM IM IM-IM-IM-S-S-S-S-IM-IM-S-S-S-IM-IM-IM-IM-IM-IM-IM IM IM IM IM IM IM