Data cleaning is crucial but often laborious in most machine learning (ML) applications. However, task-agnostic data cleaning is sometimes unnecessary if certain inconsistencies in the dirty data will not affect the prediction of ML models to the test points. A test point is certifiably robust for an ML classifier if the prediction remains the same regardless of which (among exponentially many) cleaned dataset it is trained on. In this paper, we study certifiable robustness for the Naive Bayes classifier (NBC) on dirty datasets with missing values. We present (i) a linear time algorithm in the number of entries in the dataset that decides whether a test point is certifiably robust for NBC, (ii) an algorithm that counts for each label, the number of cleaned datasets on which the NBC can be trained to predict that label, and (iii) an efficient optimal algorithm that poisons a clean dataset by inserting the minimum number of missing values such that a test point is not certifiably robust for NBC. We prove that (iv) poisoning a clean dataset such that multiple test points become certifiably non-robust is NP-hard for any dataset with at least three features. Our experiments demonstrate that our algorithms for the decision and data poisoning problems achieve up to $19.5\times$ and $3.06\times$ speed-up over the baseline algorithms across different real-world datasets.
翻译:在大多数机器学习(ML)应用中,数据清理是关键,但往往十分困难。然而,任务不可知数据清理有时是不必要的,如果肮脏数据的某些不一致性不会影响对 ML 模型到测试点的预测。一个测试点对于ML 分类器来说,如果预测保持不变的话,对于ML 分类器来说,测试点是可以肯定的。不管它所训练的是哪个(数量众多)清洁的数据集,预测都是一样的。在本文中,我们研究对有缺失值的脏数据集的Nive Bayes 分类器(NBC)是否可靠。我们在数据集的条目数量中(一) 显示一个线性时间算法,以确定一个测试点是否对 NBC 的测试点是否可靠可靠,(二) 计算每个标签的算法可以算出清洁的数据集数量,而NBCBC能够用来预测该标签;以及(三) 高效的最佳算法,通过插入最起码的缺值数,从而测试点对NBC 的测试点不可靠。我们证明(四) 将一个清洁数据设置为最清洁的数据设置的精确的值,在三个时间级的基点上,从而可以证明我们最难的精确的基点的精确的基点。</s>