Distributional data Shapley value (DShapley) has recently been proposed as a principled framework to quantify the contribution of individual datum in machine learning. DShapley develops the foundational game theory concept of Shapley values into a statistical framework and can be applied to identify data points that are useful (or harmful) to a learning algorithm. Estimating DShapley is computationally expensive, however, and this can be a major challenge to using it in practice. Moreover, there has been little mathematical analyses of how this value depends on data characteristics. In this paper, we derive the first analytic expressions for DShapley for the canonical problems of linear regression, binary classification, and non-parametric density estimation. These analytic forms provide new algorithms to estimate DShapley that are several orders of magnitude faster than previous state-of-the-art methods. Furthermore, our formulas are directly interpretable and provide quantitative insights into how the value varies for different types of data. We demonstrate the practical efficacy of our approach on multiple real and synthetic datasets.
翻译:最近有人提议将沙普利值(Dshapley)作为一个原则框架,以量化个体数据在机器学习中的贡献。Dshapley将沙普利值的基本游戏理论概念发展成一个统计框架,并可用于确定对学习算法有用(或有害)的数据点。但估计沙普利是计算上昂贵的,在实践中使用它可能是一个重大挑战。此外,对这一价值如何取决于数据特性的数学分析很少。在本文中,我们为沙普利首次得出关于线性回归、二元分类和非参数密度估计的解析表。这些分析表提供了新的算法,用以估计比以往的状态方法更快的若干数量级的沙普利值。此外,我们的公式可以直接解释,并提供关于不同类型数据价值差异的定量洞察力。我们展示了我们在多个真实和合成数据集上的方法的实际效力。