We present a new framework for self-supervised representation learning by positing it as a ranking problem in an image retrieval context on a large number of random views from random sets of images. Our work is based on two intuitive observations: first, a good representation of images must yield a high-quality image ranking in a retrieval task; second, we would expect random views of an image to be ranked closer to a reference view of that image than random views of other images. Hence, we model representation learning as a learning-to-rank problem in an image retrieval context, and train it by maximizing average precision (AP) for ranking. Specifically, given a mini-batch of images, we generate a large number of positive/negative samples and calculate a ranking loss term by separately treating each image view as a retrieval query. The new framework, dubbed S2R2, enables computing a global objective compared to the local objective in the popular contrastive learning framework calculated on pairs of views. A global objective leads S2R2 to faster convergence in terms of the number of epochs. In principle, by using a ranking criterion, we eliminate reliance on object-centered curated datasets (e.g., ImageNet). When trained on STL10 and MS-COCO, S2R2 outperforms SimCLR and performs on par with the state-of-the-art clustering-based contrastive learning model, SwAV, while being much simpler both conceptually and implementation-wise. Furthermore, when trained on a small subset of MS-COCO with fewer similar scenes, S2R2 significantly outperforms both SwAV and SimCLR. This indicates that S2R2 is potentially more effective on diverse scenes and decreases the need for a large training dataset for self-supervised learning.
翻译:我们为自我监督的演示学习提供了一个新的框架, 将它作为来自随机图像集的大量随机视图的图像检索背景下的排序问题。 我们的工作基于两种直观的观察: 首先, 良好的图像显示必须产生高质量的图像排序; 第二, 我们期望对图像的随机视图比其他图像的随机视图更接近于该图像的参考视图。 因此, 我们将学习作为在图像检索背景下的从学习到排序的一个问题, 并通过尽可能提高平均精确度( AP) 来进行排序。 具体地说, 鉴于经过培训的图像的微型批量, 我们生成了大量正/ 负样本, 并通过将每个图像视图分别作为检索查询来计算高质量的图像排序; 第二, 我们期望对图像的随机随机视图比其他图像的随机查看框架更接近于该图像的参考视图。 因此, 一个全球目标将S2R2 的S2 级到更低级的图像检索, 培训的S2级( 水平) 和S- CO 的快速运行, 通过排序标准, 我们不再依赖S- 目标- 目标- R 的S- slod 和 S- slod- s- slod- deal 的运行, 显示, 我们- s- s- s- s- s- s- s- s- be be be be be be be be be be be be be be be be laut the laut the 和 sweal- swead- s