As data emerges as a vital driver of technological and economic advancements, a key challenge is accurately quantifying its value in algorithmic decision-making. The Shapley value, a well-established concept from cooperative game theory, has been widely adopted to assess the contribution of individual data sources in supervised machine learning. However, its symmetry axiom assumes all players in the cooperative game are homogeneous, which overlooks the complex structures and dependencies present in real-world datasets. To address this limitation, we extend the traditional data Shapley framework to asymmetric data Shapley, making it flexible enough to incorporate inherent structures within the datasets for structure-aware data valuation. We also introduce an efficient $k$-nearest neighbor-based algorithm for its exact computation. We demonstrate the practical applicability of our framework across various machine learning tasks and data market contexts. The code is available at: https://github.com/xzheng01/Asymmetric-Data-Shapley.
翻译:随着数据成为技术和经济发展的关键驱动力,准确量化其在算法决策中的价值成为一个核心挑战。Shapley值作为合作博弈论中一个成熟的概念,已被广泛应用于监督式机器学习中评估个体数据源的贡献。然而,其对称性公理假设合作博弈中的所有参与者是同质的,这忽略了现实世界数据集中存在的复杂结构和依赖关系。为克服这一局限,我们将传统的数据Shapley框架扩展为非对称数据Shapley,使其能够灵活地纳入数据集的内在结构,实现结构感知的数据价值评估。我们还提出了一种基于k近邻的高效算法用于其精确计算。我们在多种机器学习任务和数据市场场景中验证了该框架的实际适用性。代码发布于:https://github.com/xzheng01/Asymmetric-Data-Shapley。