Contrastive representation learning has proven to be an effective self-supervised learning method for images and videos. Most successful approaches are based on Noise Contrastive Estimation (NCE) and use different views of an instance as positives that should be contrasted with other instances, called negatives, that are considered as noise. However, several instances in a dataset are drawn from the same distribution and share underlying semantic information. A good data representation should contain relations between the instances, or semantic similarity and dissimilarity, that contrastive learning harms by considering all negatives as noise. To circumvent this issue, we propose a novel formulation of contrastive learning using semantic similarity between instances called Similarity Contrastive Estimation (SCE). Our training objective is a soft contrastive one that brings the positives closer and estimates a continuous distribution to push or pull negative instances based on their learned similarities. We validate empirically our approach on both image and video representation learning. We show that SCE performs competitively with the state of the art on the ImageNet linear evaluation protocol for fewer pretraining epochs and that it generalizes to several downstream image tasks. We also show that SCE reaches state-of-the-art results for pretraining video representation and that the learned representation can generalize to video downstream tasks.
翻译:在图像和视频方面,通过自我监督的学习方法,相互对立的教学方法已证明是一种有效的自我监督的图像和视频学习方法。大多数成功的方法都基于噪音对比性估计法(NCE),并使用不同观点作为正面实例,与其他实例(称为负面,被视为噪音)形成对比,称为负面,但一个数据集中的一些实例来自同样的分布,并分享基本的语义信息。良好的数据代表法应当包含实例之间的关系,或者语义相似性和差异性,通过将所有负面内容视为噪音来形成对比性学习的伤害。为避免这一问题,我们建议采用新颖的对比性学习模式,使用类似词义相似性相似性,即类似性对比性刺激(SCE)的例子。我们的培训目标是软化的对比性,使正性更加接近,并估计持续分布,根据所学到的相似性来推动或拉动负面实例。我们用经验验证我们在图像和视频代表制学习方面的做法。我们显示,SCEE在图像网络在线评价协议上与艺术状态竞争,以较少的预导教程前教程方式进行。我们还可以在视频上展示一些下游演进式任务。