Developing modern machine learning (ML) applications is data-centric, of which one fundamental challenge is to understand the influence of data quality to ML training -- "Which training examples are 'guilty' in making the trained ML model predictions inaccurate or unfair?" Modeling data influence for ML training has attracted intensive interest over the last decade, and one popular framework is to compute the Shapley value of each training example with respect to utilities such as validation accuracy and fairness of the trained ML model. Unfortunately, despite recent intensive interest and research, existing methods only consider a single ML model "in isolation" and do not consider an end-to-end ML pipeline that consists of data transformations, feature extractors, and ML training. We present DataScope (ease.ml/datascope), the first system that efficiently computes Shapley values of training examples over an end-to-end ML pipeline, and illustrate its applications in data debugging for ML training. To this end, we first develop a novel algorithmic framework that computes Shapley value over a specific family of ML pipelines that we call canonical pipelines: a positive relational algebra query followed by a K-nearest-neighbor (KNN) classifier. We show that, for many subfamilies of canonical pipelines, computing Shapley value is in PTIME, contrasting the exponential complexity of computing Shapley value in general. We then put this to practice -- given an sklearn pipeline, we approximate it with a canonical pipeline to use as a proxy. We conduct extensive experiments illustrating different use cases and utilities. Our results show that DataScope is up to four orders of magnitude faster over state-of-the-art Monte Carlo-based methods, while being comparably, and often even more, effective in data debugging.
翻译:开发现代机器学习(ML)应用程序以数据为中心,其中一项基本挑战就是理解数据质量对ML培训的影响,即“培训实例是“有罪的”使训练有素的ML模型预测不准确或不公平?”过去十年来,为ML培训模拟数据影响引起了浓厚的兴趣,而一个流行的框架是计算每个培训实例在公共事业方面的损耗值,例如经过培训的ML模型的验证准确性和公平性。不幸的是,尽管最近人们的兴趣和研究非常密集,但现有方法只考虑单一的ML模型“孤立”而不考虑由数据转换、特征提取器和ML培训组成的端到端的ML管道。我们展示数据Scope(设置:ml/datacope/datacase),这是高效地将培训实例的损耗值与端到端端端的 MLL管道的验证值相比较。我们首先开发了一个新的算法框架,然后将Sqreal-lational-lational-deal-ligal a creal conversation lax we can lating a creal lather a creal lady lady lady lady us.weal lady lady a lady lady lady lady der der der lautal lauts lauts der thes mans man man mand der der man man man man man man der der der man der der man der der der der der der man man der der der der der man der der der lauts lauts der der lauts der der der laut der der der der der der ders ders der der der der ders der ders der der der der der der der der der der der der der der der der der der der der der der der der der der der der der der der der der der der