A core component of the recent success of self-supervised learning is cropping data augmentation, which selects sub-regions of an image to be used as positive views in the self-supervised loss. The underlying assumption is that randomly cropped and resized regions of a given image share information about the objects of interest, which the learned representation will capture. This assumption is mostly satisfied in datasets such as ImageNet where there is a large, centered object, which is highly likely to be present in random crops of the full image. However, in other datasets such as OpenImages or COCO, which are more representative of real world uncurated data, there are typically multiple small objects in an image. In this work, we show that self-supervised learning based on the usual random cropping performs poorly on such datasets. We propose replacing one or both of the random crops with crops obtained from an object proposal algorithm. This encourages the model to learn both object and scene level semantic representations. Using this approach, which we call object-aware cropping, results in significant improvements over scene cropping on classification and object detection benchmarks. For example, on OpenImages, our approach achieves an improvement of 8.8% mAP over random scene-level cropping using MoCo-v2 based pre-training. We also show significant improvements on COCO and PASCAL-VOC object detection and segmentation tasks over the state-of-the-art self-supervised learning approaches. Our approach is efficient, simple and general, and can be used in most existing contrastive and non-contrastive self-supervised learning frameworks.
翻译:自我监督学习的核心组成部分是数据增强。其中,裁剪是选择图像子区域用作自我监督损失中的正例视图。其基本假设是,随机裁剪和调整大小的给定图像的子区域共享有关感兴趣物体的信息,而学习到的表征将捕获这些信息。但是,这个假设在OpenImages和COCO等更真实的未筛选数据集中往往不成立,因为这些数据集中通常存在多个小物体。在本文中,我们展示了通常的随机裁剪的自我监督学习在这些数据集上表现不佳。我们提出用物体多边形算法获得的裁剪替换一些或全部随机裁剪。这能够鼓励模型学习物体和场景级别的语义表征。使用我们的方法(称为物体感知裁剪),在分类和物体检测基准上,将产生比场景裁剪更显著的改进。例如,在OpenImages上,我们的方法在以MoCo-v2为基础的预训练中比随机场景级别裁剪提高了8.8%的mAP。我们还通过自监督学习方法在COCO和PASCAL-VOC目标检测和分割任务上取得了显著的改进。我们的方法高效、简单、通用,可用于大多数现有的对比和非对比自我监督学习框架。