The Shapley value is a game-theoretic notion for wealth distribution that is nowadays extensively used to explain complex data-intensive computation, for instance, in network analysis or machine learning. Recent theoretical works show that query evaluation over relational databases fits well in this explanation paradigm. Yet, these works fall short of providing practical solutions to the computational challenge inherent to the Shapley computation. We present in this paper two practically effective solutions for computing Shapley values in query answering. We start by establishing a tight theoretical connection to the extensively studied problem of query evaluation over probabilistic databases, which allows us to obtain a polynomial-time algorithm for the class of queries for which probability computation is tractable. We then propose a first practical solution for computing Shapley values that adopts tools from probabilistic query evaluation. In particular, we capture the dependence of query answers on input database facts using Boolean expressions (data provenance), and then transform it, via Knowledge Compilation, into a particular circuit form for which we devise an algorithm for computing the Shapley values. Our second practical solution is a faster yet inexact approach that transforms the provenance to a Conjunctive Normal Form and uses a heuristic to compute the Shapley values. Our experiments on TPC-H and IMDB demonstrate the practical effectiveness of our solutions.
翻译:Shapley 值是财富分配的游戏理论概念,目前广泛用于解释复杂的数据密集计算,例如网络分析或机器学习。最近的理论工作表明,对关系数据库的查询评估非常适合这一解释模式。然而,这些工作没有为Shapley 计算所固有的计算挑战提供切实可行的解决方案。我们在本文件中提出了两种在问答中计算损耗值的实际有效的解决方案。我们首先在理论上与广泛研究的对概率数据库的查询评估问题建立紧密的联系,从而使我们能够为概率计算可以牵动的查询类别获得一个多数字时间算法。我们随后提出了第一个实用的解决方案,用于计算利用概率性查询评估工具的Shalpley 值。特别是,我们用Boolean 表达法(数据导出)记录对输入数据库事实的查询答案的依赖性,然后通过知识汇编将其转换成一种特定的电路形式,我们为此设计一种计算损耗值的算法。我们的第二个实际解决方案是更快但又非常快速的解算法,将我们的实际实验法转换成一个软件格式。