Contrastive learning aims to extract distinctive features from data by finding an embedding representation where similar samples are close to each other, and different ones are far apart. We study how NNs generalize the concept of similarity in the presence of noise, investigating two phenomena: Double Descent (DD) behavior and online/offline correspondence. While DD examines how the network adjusts to the dataset during a long training time or by increasing the number of parameters, online/offline correspondence compares the network performances varying the quality (diversity) of the dataset. We focus on the simplest contrastive learning representative: Siamese Neural Networks (SNNs). We point out that SNNs can be affected by two distinct sources of noise: Pair Label Noise (PLN) and Single Label Noise (SLN). The effect of SLN is asymmetric, but it preserves similarity relations, while PLN is symmetric but breaks transitivity. We find that DD also appears in SNNs and is exacerbated by noise. We show that the dataset topology crucially affects generalization. While sparse datasets show the same performances under SLN and PLN for an equal amount of noise, SLN outperforms PLN in the overparametrized region in dense datasets. Indeed, in this regime, PLN similarity violation becomes macroscopical, corrupting the dataset to the point where complete overfitting cannot be achieved. We call this phenomenon Density-Induced Break of Similarity (DIBS). Probing the equivalence between online optimization and offline generalization in SNNs, we find that their correspondence breaks down in the presence of label noise for all the scenarios considered.
翻译:对比性学习的目的是通过找到一个嵌入式代表,从数据中提取与众不同的特征,其中相似的样本彼此接近,而不同的样本则大相径庭。我们研究NNs如何在噪音存在时普遍使用相似的概念,调查两种现象:双源(DD)行为和在线/脱线通信。DD检查网络在长期培训期间如何调整到数据集,或者通过增加参数数量,而在线/脱线通信则比较网络性能的差异性能(多样性)数据集的质量(DDR)差异。我们侧重于简单的对比性学习代表:Siames神经网络(SNNNNS)。我们指出,SNNNN在噪音存在时,可能会受到两种不同的声音来源的影响:Pair Label Noise(PLN)和S单级Label Noise(SLN)。SLN的效应是不对称的,但它保留了相似性关系,而PLN(DN)在SNCS系统中,DR的精确性能和声势性(我们发现,SDR的稳定性在SDR的稳定性中,在SDRSDR的数值中,在SDR的数值中,在SDR的数值中,在SBSBSA值中会比值中会打破。