The presence of mislabeled observations in data is a notoriously challenging problem in statistics and machine learning, associated with poor generalization properties for both traditional classifiers and, perhaps even more so, flexible classifiers like neural networks. Here we propose a novel double regularization of the neural network training loss that combines a penalty on the complexity of the classification model and an optimal reweighting of training observations. The combined penalties result in improved generalization properties and strong robustness against overfitting in different settings of mislabeled training data and also against variation in initial parameter values when training. We provide a theoretical justification for our proposed method derived for a simple case of logistic regression. We demonstrate the double regularization model, here denoted by DRFit, for neural net classification of (i) MNIST and (ii) CIFAR-10, in both cases with simulated mislabeling. We also illustrate that DRFit identifies mislabeled data points with very good precision. This provides strong support for DRFit as a practical of-the-shelf classifier, since, without any sacrifice in performance, we get a classifier that simultaneously reduces overfitting against mislabeling and gives an accurate measure of the trustworthiness of the labels.
翻译:数据中存在错误标签的观测是一个众所周知的具有挑战性的统计和机器学习问题,与传统分类者以及神经网络等灵活分类者普遍化特性差(也许更像神经网络等灵活分类者)相关。我们在这里建议对神经网络培训损失实行新的双重正规化,把对分类模型复杂性的处罚和对培训观测的最佳重新加权结合起来。合并处罚的结果是改进了通用特性和强健性,防止在不同的环境下过度配置错误标签的培训数据,也防止了初始参数值在培训时出现差异。我们为我们为一个简单的后勤回归案例的拟议方法提供了理论依据。我们在这里用DRFit表示的双重正规化模式展示了神经网络分类(一) MNIST和(二) CIFAR-10的双重正规化模式,在这两种情况下都使用模拟错误标签。我们还说明DRFit发现错误标签的数据点非常精确。这为DRFit提供了强有力的支持,因为它是一个实用的现成分类方法,因为在不作任何牺牲的情况下,我们得到一个分类,同时降低对错误标签和标签的准确度。