Learning from noisy labels plays an important role in the deep learning era. Despite numerous studies with promising results, identifying clean labels from a noisily-annotated dataset is still challenging since the conventional noisy label learning problem with single noisy label per instance is not identifiable, i.e., it does not theoretically have a unique solution unless one has access to clean labels or introduces additional assumptions. This paper aims to formally investigate such identifiability issue by formulating the noisy label learning problem as a multinomial mixture model, enabling the formulation of the identifiability constraint. In particular, we prove that the noisy label learning problem is identifiable if there are at least $2C - 1$ noisy labels per instance provided, with $C$ being the number of classes. In light of such requirement, we propose a method that automatically generates additional noisy labels per training sample by estimating the noisy label distribution based on nearest neighbours. Such additional noisy labels allow us to apply the Expectation - Maximisation algorithm to estimate the posterior of clean labels. We empirically demonstrate that the proposed method is not only capable of estimating clean labels without any heuristics in several challenging label noise benchmarks, including synthetic, web-controlled and real-world label noises, but also of performing competitively with many state-of-the-art methods.
翻译:从噪音标签中学习的噪音标签在深层次学习时代起着重要作用。尽管有许多研究取得了令人乐观的成果,但从一个有注释的数据集中找出清洁标签仍然具有挑战性,因为无法识别每个实例都有一个单声标签的常规噪音标签学习问题,也就是说,除非一个人有机会获得清洁标签或引入额外的假设,否则理论上就没有一个独特的解决办法。本文的目的是通过将噪音标签学习问题作为多声标签混合模型来正式调查这种可识别性问题,从而能够形成识别性制约。特别是,我们证明,如果每个实例至少提供2C-1美元的噪音标签,那么,就能够识别噪音标签学习问题。鉴于这种要求,我们建议采用一种方法,通过估计以近邻为主的噪音分布,在培训样本中自动产生更多的噪音标签。这些额外的噪音标签使我们能够应用期望 - 最大化算法来估计清洁标签的假象。我们的经验证明,拟议的方法不仅能够估算清洁标签的清洁标签,而且每例提供至少2C-1美元,每张的吵杂的标签数量。我们提出了一个方法,其中包括若干具有挑战性的合成噪音的合成的标签和价格标准。