Recent progress in contrastive learning has revolutionized unsupervised representation learning. Concretely, multiple views (augmentations) from the same image are encouraged to map to the similar embeddings, while views from different images are pulled apart. In this paper, through visualizing and diagnosing classification errors, we observe that current contrastive models are ineffective at localizing the foreground object, limiting their ability to extract discriminative high-level features. This is due to the fact that view generation process considers pixels in an image uniformly. To address this problem, we propose a data-driven approach for learning invariance to backgrounds. It first estimates foreground saliency in images and then creates augmentations by copy-and-pasting the foreground onto a variety of backgrounds. The learning still follows the instance discrimination pretext task, so that the representation is trained to disregard background content and focus on the foreground. We study a variety of saliency estimation methods, and find that most methods lead to improvements for contrastive learning. With this approach (DiLo), significant performance is achieved for self-supervised learning on ImageNet classification, and also for object detection on PASCAL VOC and MSCOCO.
翻译:对比性学习的最近进展革命了未受监督的演示学习。 具体地说, 鼓励同一图像的多种观点( 放大) 映射到相似的嵌入中, 同时将不同图像的视图分开。 在本文中, 通过可视化和诊断分类错误, 我们观察到, 目前的对比性模型在定位前景对象时没有效果, 限制了它们提取高层次歧视性特征的能力。 这是因为视觉生成过程将像素统一地放在图像中。 为了解决这个问题, 我们提出了一种数据驱动的学习背景差异的方法。 它首先估计图像中的表面突出度, 然后通过复制和绘制地表背景到不同背景上来产生增强。 学习仍然遵循实例歧视借口任务, 因此, 演示性模型被训练不理会背景内容, 并侧重于地表层。 我们研究各种突出的估算方法, 并发现大多数方法都会导致对比性学习的改进。 有了这个方法( diloo), 在图像网络分类上自我监控CO的学习取得了显著的成绩, 在图像网络分类上, 并用于检测对象 PR 。