重新思考InfoNCE:你需要多少个否定样本? (Rethinking InfoNCE: How Many Negative Samples Do You Need?)

InfoNCE loss is a widely used loss function for contrastive model training. It aims to estimate the mutual information between a pair of variables by discriminating between each positive pair and its associated $K$ negative pairs. It is proved that when the sample labels are clean, the lower bound of mutual information estimation is tighter when more negative samples are incorporated, which usually yields better model performance. However, in many real-world tasks the labels often contain noise, and incorporating too many noisy negative samples for model training may be suboptimal. In this paper, we study how many negative samples are optimal for InfoNCE in different scenarios via a semi-quantitative theoretical framework. More specifically, we first propose a probabilistic model to analyze the influence of the negative sampling ratio $K$ on training sample informativeness. Then, we design a training effectiveness function to measure the overall influence of training samples on model learning based on their informativeness. We estimate the optimal negative sampling ratio using the $K$ value that maximizes the training effectiveness function. Based on our framework, we further propose an adaptive negative sampling method that can dynamically adjust the negative sampling ratio to improve InfoNCE based model training. Extensive experiments on different real-world datasets show our framework can accurately predict the optimal negative sampling ratio in different tasks, and our proposed adaptive negative sampling method can achieve better performance than the commonly used fixed negative sampling ratio strategy.

翻译：InfoNCE损失是用来进行对比式模型培训的一种广泛使用的损失函数。它的目的是通过对每对正对和相关的负负对夫妇加以区分来估计一对变量之间的相互信息。事实证明,当样品标签是干净的时, 相互信息估计的较低约束会比较紧, 如果纳入的样品是负面的, 通常会产生更好的模型性能。但是, 在许多现实世界中, 标签往往含有噪音, 并且将过多的吵闹负面样本纳入模型培训可能不够理想。在本文中, 我们通过半定量理论框架, 研究在不同的情景中, 信息NCE 中有多少负面样本是最佳的。更具体地说, 我们首先提出一种概率模型分析负比负比负的抽样比率影响分析的概率模型。然后, 我们设计一个培训效力功能, 衡量培训样品在模型学习中的总体影响, 我们用美元来估计最佳的负比值来最大限度地提高培训效果功能。我们根据我们的框架, 进一步建议一种适应性的负抽样方法, 能够动态地调整负面取样率比率, 来改进对样品抽样率比率比率进行精确的对比,,,, 从而改进我们使用不同的精确地测试方法, 。