通过放假单出错交叉校验检测标签噪声 (Detecting Label Noise via Leave-One-Out Cross-Validation)

We present a simple algorithm for identifying and correcting real-valued noisy labels from a mixture of clean and corrupted sample points using Gaussian process regression. A heteroscedastic noise model is employed, in which additive Gaussian noise terms with independent variances are associated with each and all of the observed labels. Optimizing the noise model using maximum likelihood estimation leads to the containment of the GPR model's predictive error by the posterior standard deviation in leave-one-out cross-validation. A multiplicative update scheme is proposed for solving the maximum likelihood estimation problem under non-negative constraints. While we provide proof of convergence for certain special cases, the multiplicative scheme has empirically demonstrated monotonic convergence behavior in virtually all our numerical experiments. We show that the presented method can pinpoint corrupted sample points and lead to better regression models when trained on synthetic and real-world scientific data sets.

翻译：我们提出了一个简单的算法,用高森进程回归法来查明和纠正由干净和腐败的抽样点混合而成的具有实际价值的噪音标签。我们采用了一种混凝土噪音模型,在这种模型中,与每个和所有观察到的标签都有独立的差异的加固高斯噪音术语。利用最大可能性的估算优化噪音模型,可以遏制GPR模型的预测误差,这种误差是请假一次交叉校验的事后标准偏差造成的。我们提出了一种倍增式更新方案,以解决非负性限制下的最大可能性估算问题。虽然我们对某些特殊案例提供了趋同证据,但多复制方案在几乎所有的数值实验中都以经验方式证明了单体趋同行为。我们表明,在对合成和现实世界科学数据集进行培训时,所提出的方法可以确定腐败的样品点,并导致更好的回归模型。