We investigate problems in penalized $M$-estimation, inspired by applications in machine learning debugging. Data are collected from two pools, one containing data with possibly contaminated labels, and the other which is known to contain only cleanly labeled points. We first formulate a general statistical algorithm for identifying buggy points and provide rigorous theoretical guarantees under the assumption that the data follow a linear model. We then present two case studies to illustrate the results of our general theory and the dependence of our estimator on clean versus buggy points. We further propose an algorithm for tuning parameter selection of our Lasso-based algorithm and provide corresponding theoretical guarantees. Finally, we consider a two-person "game" played between a bug generator and a debugger, where the debugger can augment the contaminated data set with cleanly labeled versions of points in the original data pool. We establish a theoretical result showing a sufficient condition under which the bug generator can always fool the debugger. Nonetheless, we provide empirical results showing that such a situation may not occur in practice, making it possible for natural augmentation strategies combined with our Lasso debugging algorithm to succeed.
翻译:我们根据机器学习调试中的应用,调查了惩罚性估算$M$的问题。数据是从两个集合收集的,一个集合含有可能受到污染的标签数据,另一个集合已知仅含有清洁标签点。我们首先制定用于识别错误点的一般统计算法,并在假设数据遵循线性模型的情况下提供严格的理论保证。我们然后提出两个案例研究,以说明我们的一般理论的结果和我们的估算器对清洁点与错误点的依赖性。我们进一步提出调控我们基于激光索的算法参数选择的算法,并提供相应的理论保证。最后,我们考虑在错误生成器和调试器之间播放的两个人“游戏 ”, 使调试器能够以原始数据库中清洁标签点的版本来增加受污染的数据集。我们建立一个理论结果,显示一个充分的条件,使错误生成器总是能够愚弄调试器。然而,我们提供了经验结果,表明这种情况在实践中可能不会发生,使得自然增强战略与我们的激光调试算法能够成功。