We present a simple algorithm for identifying and correcting real-valued noisy labels from a mixture of clean and corrupted samples using Gaussian process regression. A heteroscedastic noise model is employed, in which additive Gaussian noise terms with independent variances are associated with each and all of the observed labels. Thus, the method effectively applies a sample-specific Tikhonov regularization term, generalizing the uniform regularization prevalent in standard Gaussian process regression. Optimizing the noise model using maximum likelihood estimation leads to the containment of the GPR model's predictive error by the posterior standard deviation in leave-one-out cross-validation. A multiplicative update scheme is proposed for solving the maximum likelihood estimation problem under non-negative constraints. While we provide a proof of monotonic convergence for certain special cases, the multiplicative scheme has empirically demonstrated monotonic convergence behavior in virtually all our numerical experiments. We show that the presented method can pinpoint corrupted samples and lead to better regression models when trained on synthetic and real-world scientific data sets.
翻译:我们提出了一个简单的算法,用高森进程回归法来查明和纠正由清洁和腐败的样本混合而成的、真正有价值的噪音标签。我们采用了一种超小型噪声模型,在这种模型中,与每个和所有观察到的标签都有独立的差异。因此,该方法有效地应用了一个特定样本的Tikhonov正规化术语,在标准高森进程回归法中普遍采用统一规范化。利用最大可能性估计法优化噪音模型,导致GPR模型的后方标准偏差在外方交叉校准中抑制GPR模型的预测错误。提出了一种多复制性更新计划,以解决非负性限制下的最大可能性估算问题。虽然我们为某些特殊案例提供了单一性趋同的证据,但多复制计划在几乎所有的数值实验中都以经验方式证明了单一的趋同行为。我们表明,在对合成和现实世界科学数据集进行培训时,所提出的方法可以确定腐蚀的样品,并导致更好的回归模型。