ML models are typically trained using large datasets of high quality. However, training datasets often contain inconsistent or incomplete data. To tackle this issue, one solution is to develop algorithms that can check whether a prediction of a model is certifiably robust. Given a learning algorithm that produces a classifier and given an example at test time, a classification outcome is certifiably robust if it is predicted by every model trained across all possible worlds (repairs) of the uncertain (inconsistent) dataset. This notion of robustness falls naturally under the framework of certain answers. In this paper, we study the complexity of certifying robustness for a simple but widely deployed classification algorithm, $k$-Nearest Neighbors ($k$-NN). Our main focus is on inconsistent datasets when the integrity constraints are functional dependencies (FDs). For this setting, we establish a dichotomy in the complexity of certifying robustness w.r.t. the set of FDs: the problem either admits a polynomial time algorithm, or it is coNP-hard. Additionally, we exhibit a similar dichotomy for the counting version of the problem, where the goal is to count the number of possible worlds that predict a certain label. As a byproduct of our study, we also establish the complexity of a problem related to finding an optimal subset repair that may be of independent interest.
翻译:ML 模型通常使用质量高的大型数据集进行培训。 但是, 培训数据集通常包含不一致或不完整的数据。 要解决这一问题, 一种解决办法是开发算法, 可以检查对模型的预测是否可靠。 鉴于一种产生分类器的学习算法, 并在测试时给出一个示例, 分类结果如果由在所有可能的世界中培训过的每个模型( 修复) 预测的不确定( 不一致) 数据集的不确定性( 修复), 分类结果是可靠的。 这种稳健性概念自然属于某些答案的框架。 在本文中, 我们研究为简单但广泛部署的分类算法( $k$- Nearest Neighbors $- NNNN) 验证稳健性的复杂性。 我们的主要重点是当完整性受限是功能依赖( FDs) 时, 以不一致的方式建立不一致的数据集。 我们的分类方法的复杂性概念是: 要么承认一个多盘时间算法, 要么就是它具有 ConnP- hard 。 此外, 我们的主要重点是一个类似的直数, 我们用一种直观来算出一个我们可能由某版本的精确的精确的标签来算出一个目标。