Not all data in a typical training set help with generalization; some samples can be overly ambiguous or outrightly mislabeled. This paper introduces a new method to identify such samples and mitigate their impact when training neural networks. At the heart of our algorithm is the Area Under the Margin (AUM) statistic, which exploits differences in the training dynamics of clean and mislabeled samples. A simple procedure - adding an extra class populated with purposefully mislabeled threshold samples - learns a AUM upper bound that isolates mislabeled data. This approach consistently improves upon prior work on synthetic and real-world datasets. On the WebVision50 classification task our method removes 17% of training data, yielding a 1.6% (absolute) improvement in test error. On CIFAR100 removing 13% of the data leads to a 1.2% drop in error.
翻译:典型的培训数据集中并非所有数据都有助于概括化; 一些样本可能过于模糊或明显贴错标签。 本文在培训神经网络时引入了一种新的方法来识别这些样本并减轻其影响。 我们的算法核心是“ 边际区域( AUM) ” 统计, 该统计利用了清洁和贴错标签样本培训动态的差异。 一个简单程序 — — 添加一个额外类别,其中含有故意贴错标签的阈值样本 — — 学习一个AUM上限,其中分离了错误标签的数据。 这种方法在以前关于合成和真实世界数据集的工作上不断改进。 在WebVision50分类任务中,我们的方法删除了17%的培训数据,导致测试错误的1.6%( 绝对) 改进。 在CIFAR100中, 13%的数据去除导致1.2%的误差。