Contrastive self-supervised learning has largely narrowed the gap to supervised pre-training on ImageNet. However, its success highly relies on the object-centric priors of ImageNet, i.e., different augmented views of the same image correspond to the same object. Such a heavily curated constraint becomes immediately infeasible when pre-trained on more complex scene images with many objects. To overcome this limitation, we introduce Object-level Representation Learning (ORL), a new self-supervised learning framework towards scene images. Our key insight is to leverage image-level self-supervised pre-training as the prior to discover object-level semantic correspondence, thus realizing object-level representation learning from scene images. Extensive experiments on COCO show that ORL significantly improves the performance of self-supervised learning on scene images, even surpassing supervised ImageNet pre-training on several downstream tasks. Furthermore, ORL improves the downstream performance when more unlabeled scene images are available, demonstrating its great potential of harnessing unlabeled data in the wild. We hope our approach can motivate future research on more general-purpose unsupervised representation learning from scene data.
翻译:自我监督的自我监督学习在很大程度上缩小了在图像网络上接受监督前培训的差距。 但是,它的成功在很大程度上依赖于图像网络的以对象为中心的前置知识, 也就是对同一对象的扩大观点。 这种大量调整的制约在对许多对象的更复杂的场景图像进行预先培训后立即变得不可行。 为了克服这一限制, 我们引入了目标级别代表学习( ORL), 即对现场图像网络进行新的自我监督学习框架。 我们的关键洞察力是利用图像一级的自我监督前置知识, 作为在发现目标级别语义通信之前, 从而实现从场景图像中进行目标级别代表学习。 COCOCO的大规模实验显示, ORL 大大改进了在现场图像上自我监督学习的性能, 甚至超越了对多个下游任务的监督性图像的预培训。 此外, ORL 改进了下游的性能, 当有更多未贴标签的场景图像可用时, 显示其在野外利用未标数据的巨大潜力。 我们希望我们的方法能够激励未来对更通用的、未受监督的场景图象学的数据进行研究。