In this work, we present Multi-Level Contrastive Learning for Dense Prediction Task (MCL), an efficient self-supervised method for learning region-level feature representation for dense prediction tasks. Our method is motivated by the three key factors in detection: localization, scale consistency and recognition. To explicitly encode absolute position and scale information, we propose a novel pretext task that assembles multi-scale images in a montage manner to mimic multi-object scenarios. Unlike the existing image-level self-supervised methods, our method constructs a multi-level contrastive loss that considers each sub-region of the montage image as a singleton. Our method enables the neural network to learn regional semantic representations for translation and scale consistency while reducing pre-training epochs to the same as supervised pre-training. Extensive experiments demonstrate that MCL consistently outperforms the recent state-of-the-art methods on various datasets with significant margins. In particular, MCL obtains 42.5 AP$^\mathrm{bb}$ and 38.3 AP$^\mathrm{mk}$ on COCO with the 1x schedule fintuning, when using Mask R-CNN with R50-FPN backbone pre-trained with 100 epochs. In comparison to MoCo, our method surpasses their performance by 4.0 AP$^\mathrm{bb}$ and 3.1 AP$^\mathrm{mk}$. Furthermore, we explore the alignment between pretext task and downstream tasks. We extend our pretext task to supervised pre-training, which achieves a similar performance to self-supervised learning. This result demonstrates the importance of the alignment between pretext task and downstream tasks, indicating the potential for wider applicability of our method beyond self-supervised settings.
翻译:在本文中,我们提出了一种高效的自我监督学习方法,多层级对比学习(Multi-Level Contrastive Learning for Dense Prediction Task, 简称 MCL)用于学习密集预测任务的区域级特征表示。我们的方法受检测中三个关键因素的驱动:定位、尺度一致性和识别。为了明确编码绝对位置和尺度信息,我们提出了一种新的预文本任务,将多尺度图像以蒙太奇方式组装在一起,以模拟多个物体的场景。与现有的图像级自我监督方法不同,我们的方法构建了一个考虑蒙太奇图像的每个子区域作为单例的多层级对比损失。我们的方法使神经网络能够学习翻译和尺度一致性的区域语义表示,同时将预训练时期减少到与监督预训练相同。大量实验证明,MCL在各种数据集上始终优于最近的最先进方法,且差距显著。特别是,在使用经过100个时期预训练的R50-FPN骨干结构的Mask R-CNN进行1x调度微调时,在COCO上获得了42.5 AP$^\mathrm{bb}$和38.3 AP$^\mathrm{mk}$。与MoCo相比,我们的方法在AP$^\mathrm{bb}$ 和AP$^\mathrm{mk}$方面超过其4.0和3.1。此外,我们探索了预文本任务和下游任务之间的对齐。我们将我们的预文本任务扩展到了监督预训练,其获得的性能与自我监督学习相似。这一结果证明了预文本任务和下游任务之间的对齐的重要性,表明我们的方法在除自我监督设置之外也具有更广泛的适用性。