Self-supervised learning holds promise in leveraging large numbers of unlabeled data. However, its success heavily relies on the highly-curated dataset, e.g., ImageNet, which still needs human cleaning. Directly learning representations from less-curated scene images is essential for pushing self-supervised learning to a higher level. Different from curated images which include simple and clear semantic information, scene images are more complex and mosaic because they often include complex scenes and multiple objects. Despite being feasible, recent works largely overlooked discovering the most discriminative regions for contrastive learning to object representations in scene images. In this work, we leverage the saliency map derived from the model's output during learning to highlight these discriminative regions and guide the whole contrastive learning. Specifically, the saliency map first guides the method to crop its discriminative regions as positive pairs and then reweighs the contrastive losses among different crops by its saliency scores. Our method significantly improves the performance of self-supervised learning on scene images by +1.1, +4.3, +2.2 Top1 accuracy in ImageNet linear evaluation, Semi-supervised learning with 1% and 10% ImageNet labels, respectively. We hope our insights on saliency maps can motivate future research on more general-purpose unsupervised representation learning from scene data.
翻译:自我监督的学习在利用大量未贴标签的数据方面很有希望。 然而,它的成功在很大程度上依赖于高度精密的数据集,例如图像网络,它仍然需要人清洗。 直接从不那么精确的场景图像中学习演示对于将自我监督的学习推向更高层次至关重要。 与包括简单和清晰的语义信息在内的精美图像不同的是,场景图像更为复杂,而且由于往往包括复杂的场景和多个对象,因此具有多样性。 尽管可行,但最近的工作在很大程度上忽略了发现最有差别的区域,以便对比性学习到现场图像中的物体展示。 在这项工作中,我们利用模型输出的突出地图来突出这些歧视性区域并指导整个对比性学习。 具体地说,突出的场景图首先指导着将歧视区域作为正对种的方法,然后根据不同作物的显著分数来重新衡量对比性损失。 我们的方法极大地改进了通过+1.1、 +4.3、 +2.2 的顶部图像网络的精确度,从而分别从图像网络的直观性评估中学习了我们图像网络的深度数据。