The knowledge of the label noise transition matrix, characterizing the probabilities of a training instance being wrongly annotated, is crucial to designing popular solutions to learning with noisy labels, including loss correction and loss reweighting approaches. Existing works heavily rely on the existence of "anchor points" or their approximates, defined as instances that belong to a particular class almost surely. Nonetheless, finding anchor points remains a non-trivial task, and the estimation accuracy is also often throttled by the number of available anchor points. In this paper, we propose an alternative option to the above task. Our main contribution is the discovery of an efficient estimation procedure based on a clusterability condition. We prove that with clusterable representations of features, using up to third-order consensuses of noisy labels among neighbor representations is sufficient to estimate a unique transition matrix. Compared with methods using anchor points, our approach uses substantially more instances and benefits from a much better sample complexity. We demonstrate the estimation accuracy and advantages of our estimates using both synthetic noisy labels (on CIFAR-10/100) and real human-level noisy labels (on Clothing1M and our self-collected human-annotated CIFAR-10).
翻译:标签噪声过渡矩阵的特征是培训实例的概率被错误地附加说明,对于设计以噪音标签(包括损失纠正和损失重新加权方法)进行学习的流行解决方案至关重要。现有的工程在很大程度上依赖“锁定点”或其近似值,几乎可以肯定地定义为属于某一类的情况。尽管如此,查找锚点仍是一项非三重任务,估计准确性也常常受到现有锚点数量的影响。在本文中,我们提出了上述任务的替代选项。我们的主要贡献是发现基于可集束性条件的有效估算程序。我们证明,如果采用可集束特征的表达方式,使用邻居代表之间高达三级的吵闹标签的共识,就足以估计独特的过渡矩阵。与使用锚点的方法相比,我们的方法使用更多实例,并获益于更复杂的样本。我们用合成噪音标签(CIFAR-10-100上)和真实的人类级噪声标签(SASir1M和我们自我收集的CIRA-10)来证明我们估算的准确性和好处。