Recent self-supervised contrastive methods have been able to produce impressive transferable visual representations by learning to be invariant to different data augmentations. However, these methods implicitly assume a particular set of representational invariances (e.g., invariance to color), and can perform poorly when a downstream task violates this assumption (e.g., distinguishing red vs. yellow cars). We introduce a contrastive learning framework which does not require prior knowledge of specific, task-dependent invariances. Our model learns to capture varying and invariant factors for visual representations by constructing separate embedding spaces, each of which is invariant to all but one augmentation. We use a multi-head network with a shared backbone which captures information across each augmentation and alone outperforms all baselines on downstream tasks. We further find that the concatenation of the invariant and varying spaces performs best across all tasks we investigate, including coarse-grained, fine-grained, and few-shot downstream classification tasks, and various data corruptions.
翻译:最近自我监督的对比性方法通过学习不同数据增强的变异性,得以产生令人印象深刻的可转移的视觉表现。然而,这些方法隐含地假定了一组特定的表达性差异(例如,不易颜色),当下游任务违反这一假设时(例如,区分红色相对于黄色汽车),可以表现不力。我们引入了一个对比性学习框架,不需要事先了解具体和任务独立的差异性。我们的模型学会了通过建造不同的嵌入空间来捕捉视觉表现的不同和变异性因素,每个嵌入空间除了一个增强之外,都是不易的。我们使用多头网络,共用主干线,捕捉到每个增量之间的信息,单是超越下游任务的所有基线。我们进一步发现,异性空间和不同空间的共和性在我们调查的所有任务中,包括粗微的、精细的和少量下游分类任务,以及各种数据腐败中表现最佳。