Many recent approaches in contrastive learning have worked to close the gap between pretraining on iconic images like ImageNet and pretraining on complex scenes like COCO. This gap exists largely because commonly used random crop augmentations obtain semantically inconsistent content in crowded scene images of diverse objects. Previous works use preprocessing pipelines to localize salient objects for improved cropping, but an end-to-end solution is still elusive. In this work, we propose a framework which accomplishes this goal via joint learning of representations and segmentation. We leverage segmentation masks to train a model with a mask-dependent contrastive loss, and use the partially trained model to bootstrap better masks. By iterating between these two components, we ground the contrastive updates in segmentation information, and simultaneously improve segmentation throughout pretraining. Experiments show our representations transfer robustly to downstream tasks in classification, detection and segmentation.
翻译:最近许多对比式学习方法努力缩小图象学前培训(如图象网)和复杂场景(如COCO)前培训之间的差距。这一差距之所以存在,主要是因为常用随机作物增殖器在拥挤的多物体图像中获得了内容不一致的语义内容。以前的工作利用预处理管道将突出的物体本地化,以改进作物种植,但端对端解决办法仍然难以找到。在这项工作中,我们提出了一个通过联合学习演示和分解来实现这一目标的框架。我们利用分解面罩来训练一个带有蒙面反差的模型,并使用经过部分训练的模型来捕捉更好的遮罩。在这两个部件之间,我们用分解信息作为对比性更新的基础,同时改进整个培训前的分化。实验显示我们的表现在分类、检测和分解方面有力地转向下游任务。