Self-supervised learning has been widely used to obtain transferrable representations from unlabeled images. Especially, recent contrastive learning methods have shown impressive performances on downstream image classification tasks. While these contrastive methods mainly focus on generating invariant global representations at the image-level under semantic-preserving transformations, they are prone to overlook spatial consistency of local representations and therefore have a limitation in pretraining for localization tasks such as object detection and instance segmentation. Moreover, aggressively cropped views used in existing contrastive methods can minimize representation distances between the semantically different regions of a single image. In this paper, we propose a spatially consistent representation learning algorithm (SCRL) for multi-object and location-specific tasks. In particular, we devise a novel self-supervised objective that tries to produce coherent spatial representations of a randomly cropped local region according to geometric translations and zooming operations. On various downstream localization tasks with benchmark datasets, the proposed SCRL shows significant performance improvements over the image-level supervised pretraining as well as the state-of-the-art self-supervised learning methods.
翻译:自我监督学习被广泛用于从未贴标签的图像中获取可转移的图像。特别是,最近的对比式学习方法在下游图像分类任务上表现出令人印象深刻的成绩。这些对比性方法主要侧重于在语义保存变异的图像层产生变化式全球代表,但容易忽视当地代表的空间一致性,因此,在对物体探测和实例分割等本地化任务进行预先培训方面受到限制。此外,在现有的对比性方法中,激烈的裁剪观点可以将单一图像的语义不同区域之间的代表距离降至最低。在本文中,我们建议对多对象和特定地点的任务采用空间一致的代表性学习算法(SCRL ) 。特别是,我们设计了一个新的自我监督目标,试图根据几何翻译和缩放操作来生成随机裁剪裁的本地区域的统一空间代表。关于使用基准数据集的下游地区化任务,拟议的SCRL 显示,在受监督的图像层前阶段以及州自我监督的自我监督学习方法方面,业绩显著改进。