In a binary classification problem where the goal is to fit an accurate predictor, the presence of corrupted labels in the training data set may create an additional challenge. However, in settings where likelihood maximization is poorly behaved-for example, if positive and negative labels are perfectly separable-then a small fraction of corrupted labels can improve performance by ensuring robustness. In this work, we establish that in such settings, corruption acts as a form of regularization, and we compute precise upper bounds on estimation error in the presence of corruptions. Our results suggest that the presence of corrupted data points is beneficial only up to a small fraction of the total sample, scaling with the square root of the sample size.
翻译:在一个二进制分类问题中,如果目标是要适合准确的预测,那么培训数据集中存在腐败标签可能会带来额外的挑战。然而,在可能性最大化表现不佳的情况下,例如,如果正负标签完全可以分离,那么一小部分腐败标签可以通过确保稳健性来改善业绩。在这项工作中,我们确定,在这种环境下,腐败是一种正规化形式,我们在出现腐败的情况下对估计错误进行了精确的上限计算。我们的结果表明,存在腐败数据点只有利于总抽样的一小部分,以样本的平方根为基础。