Recent masked image modeling (MIM) has received much attention in self-supervised learning (SSL), which requires the target model to recover the masked part of the input image. Although MIM-based pre-training methods achieve new state-of-the-art performance when transferred to many downstream tasks, the visualizations show that the learned representations are less separable, especially compared to those based on contrastive learning pre-training. This inspires us to think whether the linear separability of MIM pre-trained representation can be further improved, thereby improving the pre-training performance. Since MIM and contrastive learning tend to utilize different data augmentations and training strategies, combining these two pretext tasks is not trivial. In this work, we propose a novel and flexible pre-training framework, named MimCo, which combines MIM and contrastive learning through two-stage pre-training. Specifically, MimCo takes a pre-trained contrastive learning model as the teacher model and is pre-trained with two types of learning targets: patch-level and image-level reconstruction losses. Extensive transfer experiments on downstream tasks demonstrate the superior performance of our MimCo pre-training framework. Taking ViT-S as an example, when using the pre-trained MoCov3-ViT-S as the teacher model, MimCo only needs 100 epochs of pre-training to achieve 82.53% top-1 finetuning accuracy on Imagenet-1K, which outperforms the state-of-the-art self-supervised learning counterparts.
翻译:最近蒙面图像建模(MIM)在自我监督的学习(SSL)中受到极大关注,这要求目标模型恢复输入图像中隐藏的部分内容。尽管基于MIM的培训前方法在转到许多下游任务时实现了新的最先进的业绩,但视觉化表明,所学的表述不易分解,尤其是与基于对比性学习前培训的模型相比。这启发了我们思考MIM培训前代表制的线性分离能否进一步改进,从而改善培训前的业绩。由于MIM和对比性学习往往利用不同的数据增强和培训战略,将这两项托辞合并在一起并非微不足道。在这项工作中,我们提出了一个新颖和灵活的培训前框架,名为MimCoC,将MIMMIM和对比性学习相结合,具体地说,MIMC作为师资模式,只是经过培训前的两种类型的学习目标:补级和图像级重建损失。在下游任务上进行广泛的转移实验,用MimC培训前框架的高级性工作表现。在MIMC培训前框架上,将MiS培训前的高级模型作为学习前的示范。