Contrastive self-supervised learning has outperformed supervised pretraining on many downstream tasks like segmentation and object detection. However, current methods are still primarily applied to curated datasets like ImageNet. In this paper, we first study how biases in the dataset affect existing methods. Our results show that current contrastive approaches work surprisingly well across: (i) object- versus scene-centric, (ii) uniform versus long-tailed and (iii) general versus domain-specific datasets. Second, given the generality of the approach, we try to realize further gains with minor modifications. We show that learning additional invariances -- through the use of multi-scale cropping, stronger augmentations and nearest neighbors -- improves the representations. Finally, we observe that MoCo learns spatially structured representations when trained with a multi-crop strategy. The representations can be used for semantic segment retrieval and video instance segmentation without finetuning. Moreover, the results are on par with specialized models. We hope this work will serve as a useful study for other researchers. The code and models are available at https://github.com/wvangansbeke/Revisiting-Contrastive-SSL.
翻译:自我监督的自相矛盾学习已经超过许多下游任务的监督前训练,如分割和物体探测。 但是,目前的方法仍然主要适用于图像网络等固化数据集。 在本文中,我们首先研究数据集中的偏差如何影响现有方法。 我们的结果表明,目前的对比方法在以下各方面效果惊人:(一) 对象对景点中心,(二) 制服对长尾,(三) 普通和具体领域数据集。第二,鉴于这种方法的普遍性,我们试图通过微小的修改来取得进一步的进展。我们发现,通过多尺度的裁剪、更强大的扩增和最近的邻居等方法学习更多的偏差,可以改善演示。最后,我们观察到,在经过多作物战略培训后,部会学习空间结构化的演示。这些演示可以用于语系段的检索和视频实例分割,而无需微调。此外,结果与专门模型相同。我们希望这项工作能够作为其他研究人员的有用研究。 代码和模型可在 https://github.com/wgansembres/Contistriamb。