Recently, self-supervised learning has attracted great attention, since it only requires unlabeled data for model training. Contrastive learning is one popular method for self-supervised learning and has achieved promising empirical performance. However, the theoretical understanding of its generalization ability is still limited. To this end, we define a kind of $(\sigma,\delta)$-measure to mathematically quantify the data augmentation, and then provide an upper bound of the downstream classification error rate based on the measure. It reveals that the generalization ability of contrastive self-supervised learning is related to three key factors: alignment of positive samples, divergence of class centers, and concentration of augmented data. The first two factors are properties of learned representations, while the third one is determined by pre-defined data augmentation. We further investigate two canonical contrastive losses, InfoNCE and cross-correlation, to show how they provably achieve the first two factors. Moreover, we conduct experiments to study the third factor, and observe a strong correlation between downstream performance and the concentration of augmented data.
翻译:最近,自我监督的学习引起了极大的注意,因为它只要求为模式培训提供未贴标签的数据。对比学习是自我监督学习的一种流行方法,并且取得了有希望的经验性表现。然而,对于其一般化能力的理论理解仍然有限。为此,我们定义了一种(gma,\delta)$的计量方法,从数学角度量化数据增强,然后根据该计量方法提供下游分类错误率的上限。它揭示了对比性自我监督学习的普遍化能力与三个关键因素有关:正样的对齐、类中心的差异以及扩大的数据的集中。前两个因素是学习表现的特性,而第三个因素是由预先界定的数据增强决定的。我们进一步调查两种具有可比较性的损失,即InfoNCE和交叉联系,以显示它们如何根据该计量实现前两个因素。此外,我们进行实验以研究第三个因素,并观察下游业绩与扩大的数据集中之间的强烈关联。</s>