The label noise transition matrix, characterizing the probabilities of a training instance being wrongly annotated, is crucial to designing popular solutions to learning with noisy labels. Existing works heavily rely on finding "anchor points" or their approximates, defined as instances belonging to a particular class almost surely. Nonetheless, finding anchor points remains a non-trivial task, and the estimation accuracy is also often throttled by the number of available anchor points. In this paper, we propose an alternative option to the above task. Our main contribution is the discovery of an efficient estimation procedure based on a clusterability condition. We prove that with clusterable representations of features, using up to third-order consensuses of noisy labels among neighbor representations is sufficient to estimate a unique transition matrix. Compared with methods using anchor points, our approach uses substantially more instances and benefits from a much better sample complexity. We demonstrate the estimation accuracy and advantages of our estimates using both synthetic noisy labels (on CIFAR-10/100) and real human-level noisy labels (on Clothing1M and our self-collected human-annotated CIFAR-10). Our code and human-level noisy CIFAR-10 labels are available at https://github.com/UCSC-REAL/HOC.
翻译:标签噪声过渡矩阵,以错误地附加说明的培训实例概率为特征,是设计以吵闹标签进行学习的通用解决方案的关键。现有的工程在很大程度上依赖于寻找“锚点”或其近似,几乎可以肯定地定义为属于某一类的情况。尽管如此,找到锚点仍是一项非边际任务,而且估计准确性也常常受到现有锚点数目的干扰。在本文件中,我们提出了上述任务的替代选项。我们的主要贡献是发现基于聚类条件的高效估算程序。我们证明,有可集成的特征表示,使用邻居代表之间高达三阶的噪音标签共识,就足以估计独特的过渡矩阵。与使用锚点的方法相比,我们的方法使用更多实例,并获益于更佳的抽样复杂性。我们用合成噪音标签(CIRA-10-100)和真实的人类级噪音标签(Slafl1M和我们自己联合的人类附加说明的CFAR-10)来证明我们的估算准确性和优点。我们现有的代码和人类级的CAR-10标签在http-HRC/HOARC/10标签上都是可用的。