To date, most existing self-supervised learning methods are designed and optimized for image classification. These pre-trained models can be sub-optimal for dense prediction tasks due to the discrepancy between image-level prediction and pixel-level prediction. To fill this gap, we aim to design an effective, dense self-supervised learning method that directly works at the level of pixels (or local features) by taking into account the correspondence between local features. We present dense contrastive learning, which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images. Compared to the baseline method MoCo-v2, our method introduces negligible computation overhead (only <1% slower), but demonstrates consistently superior performance when transferring to downstream dense prediction tasks including object detection, semantic segmentation and instance segmentation; and outperforms the state-of-the-art methods by a large margin. Specifically, over the strong MoCo-v2 baseline, our method achieves significant improvements of 2.0% AP on PASCAL VOC object detection, 1.1% AP on COCO object detection, 0.9% AP on COCO instance segmentation, 3.0% mIoU on PASCAL VOC semantic segmentation and 1.8% mIoU on Cityscapes semantic segmentation. Code is available at: https://git.io/AdelaiDet
翻译:迄今为止,大多数现有的自监督学习方法都是为图像分类而设计和优化的。由于图像级预测和像素级预测之间的差异,这些预先训练的模型对于密集的预测任务来说可能是亚最佳的。为了填补这一差距,我们的目标是设计一种有效、密集的自监督学习方法,在像素(或地方特征)一级直接发挥作用,同时考虑到当地特征之间的对应性。我们展示了密集的对比学习,通过优化两种投入图像视图之间的对比对比(不同)相似性,在像素层面进行自我监督学习。与MOCO-v2的基线方法相比,我们的方法引入了微不足道的计算间接费用(仅低于1 % ),但在向下游密集的象素(或地方特征)一级直接工作时,我们的目标是设计一种有效、密集的自监督学习方法,其中考虑到当地特征之间的对应性。我们展示了密集的对比性学习方法,通过优化Moo-v2基线,我们的方法在 PASAL VOC 对象探测中实现了2.0% AP 目标部分的显著改进。