Unsupervised representation learning is an important challenge in computer vision, with self-supervised learning methods recently closing the gap to supervised representation learning. An important ingredient in high-performing self-supervised methods is the use of data augmentation by training models to place different augmented views of the same image nearby in embedding space. However, commonly used augmentation pipelines treat images holistically, disregarding the semantic relevance of parts of an image-e.g. a subject vs. a background-which can lead to the learning of spurious correlations. Our work addresses this problem by investigating a class of simple, yet highly effective "background augmentations", which encourage models to focus on semantically-relevant content by discouraging them from focusing on image backgrounds. Background augmentations lead to substantial improvements (+1-2% on ImageNet-1k) in performance across a spectrum of state-of-the art self-supervised methods (MoCov2, BYOL, SwAV) on a variety of tasks, allowing us to reach within 0.3% of supervised performance. We also demonstrate that background augmentations improve robustness to a number of out of distribution settings, including natural adversarial examples, the backgrounds challenge, adversarial attacks, and ReaL ImageNet.
翻译:不受监督的代表学习是计算机愿景中的一项重要挑战,因为自监督的学习方法最近缩小了监督的代表学习差距。高性能自监督方法的一个重要要素是使用培训模型来增加数据,在嵌入空间中放置附近同一图像的不同扩大观点。然而,常用的增强管道整体地处理图像,忽视了图像的某些部分的语义相关性,例如,一个主题与背景可以导致学习虚假的关联。我们的工作通过调查一类简单而高效的“地下增强”来解决这一问题,该类“地面增强”鼓励模型侧重于与语义相关的内容,不鼓励它们关注图像背景。背景扩大导致大量改进(图像网-1-2 % 的图像网-1k ) 一系列最先进的自我监督方法(MoCov2 BYOL, SWAVAV)在各种任务中的表现(MOCO2, BYOL, SWAVA),使我们能够在监督性业绩的0.3%范围内达到。我们还表明,背景增强背景增强加强了对一些分销背景的坚固度,包括自然对抗性图像背景。