Recent masked image modeling (MIM) has received much attention in self-supervised learning (SSL), which requires the target model to recover the masked part of the input image. Although MIM-based pre-training methods achieve new state-of-the-art performance when transferred to many downstream tasks, the visualizations show that the learned representations are less separable, especially compared to those based on contrastive learning pre-training. This inspires us to think whether the linear separability of MIM pre-trained representation can be further improved, thereby improving the pre-training performance. Since MIM and contrastive learning tend to utilize different data augmentations and training strategies, combining these two pretext tasks is not trivial. In this work, we propose a novel and flexible pre-training framework, named MimCo, which combines MIM and contrastive learning through two-stage pre-training. Specifically, MimCo takes a pre-trained contrastive learning model as the teacher model and is pre-trained with two types of learning targets: patch-level and image-level reconstruction losses. Extensive transfer experiments on downstream tasks demonstrate the superior performance of our MimCo pre-training framework. Taking ViT-S as an example, when using the pre-trained MoCov3-ViT-S as the teacher model, MimCo only needs 100 epochs of pre-training to achieve 82.53% top-1 finetuning accuracy on Imagenet-1K, which outperforms the state-of-the-art self-supervised learning counterparts.
翻译:最近,基于蒙版图像建模(MIM)的自监督学习(SSL)受到了广泛关注,该方法需要目标模型恢复输入图像中的蒙版部分。尽管基于MIM的预训练方法在许多下游任务中实现了新的最先进性能,但可视化结果显示,学习到的表示不够可分,尤其是与基于对比性学习的预训练相比。这促使我们思考是否可以进一步改进MIM预先训练表示的线性可分性,从而提高预训练性能。由于MIM和对比性学习往往利用不同的数据增强和训练策略,因此将这两个预文本任务组合起来并不容易。在这项工作中,我们提出了一种新颖而灵活的预训练框架,名为MimCo,通过双阶段预训练将MIM和对比性学习相结合。具体来说,MimCo以预先训练的对比性学习模型作为教师模型,并采用两种类型的学习目标进行预先训练:补丁级别和图像级别的重构损失。广泛的传输实验表明了我们的MimCo预训练框架的卓越性能。以ViT-S为例,当使用预先训练的MoCov3-ViT-S作为教师模型时,MimCo仅需要进行100个epoch的预先训练,即可在Imagenet-1K上实现82.53%的前馈调整准确率,优于最先进的自监督学习对应物。