Label noise is frequently observed in real-world large-scale datasets. The noise is introduced due to a variety of reasons; it is heterogeneous and feature-dependent. Most existing approaches to handling noisy labels fall into two categories: they either assume an ideal feature-independent noise, or remain heuristic without theoretical guarantees. In this paper, we propose to target a new family of feature-dependent label noise, which is much more general than commonly used i.i.d. label noise and encompasses a broad spectrum of noise patterns. Focusing on this general noise family, we propose a progressive label correction algorithm that iteratively corrects labels and refines the model. We provide theoretical guarantees showing that for a wide variety of (unknown) noise patterns, a classifier trained with this strategy converges to be consistent with the Bayes classifier. In experiments, our method outperforms SOTA baselines and is robust to various noise types and levels.
翻译:在现实世界大型数据集中经常观测到标签噪音。噪音是因多种原因而引入的;它具有多样性和特征依赖性。处理噪音标签的多数现有办法分为两类:它们要么假定一种理想的地物独立噪音,要么在没有理论保障的情况下保持超常性。在本文中,我们提议针对新的地物依赖标签噪音系列,这比通常使用的i.d.标签噪音要一般得多,并包含广泛的噪音模式。我们以这个一般噪音族为焦点,提出一种渐进式标签校正算法,迭接性校正标签和完善模型。我们提供理论保证,表明对于多种(未知的)噪音模式,受过这种战略培训的分类师会与Bayes分类师相一致。在实验中,我们的方法比SOTA基线要强得多,并且对各种噪音类型和级别都强大。