Data valuation is an essential task in a data marketplace. It aims at fairly compensating data owners for their contribution. There is increasing recognition in the machine learning community that the Shapley value -- a foundational profit-sharing scheme in cooperative game theory -- has major potential to value data, because it uniquely satisfies basic properties for fair credit allocation and has been shown to be able to identify data sources that are useful or harmful to model performance. However, calculating the Shapley value requires accessing original data sources. It still remains an open question how to design a real-world data marketplace that takes advantage of the Shapley value-based data pricing while protecting privacy and allowing fair payments. In this paper, we propose the {\em first} prototype of a data marketplace that values data sources based on the Shapley value in a privacy-preserving manner and at the same time ensures fair payments. Our approach is enabled by a suite of innovations on both algorithm and system design. We firstly propose a Shapley value calculation algorithm that can be efficiently implemented via multiparty computation (MPC) circuits. The key idea is to learn a performance predictor that can directly predict model performance corresponding to an input dataset without performing actual training. We further optimize the MPC circuit design based on the structure of the performance predictor. We further incorporate fair payment into the MPC circuit to guarantee that the data that the buyer pays for is exactly the same as the one that has been valuated. Our experimental results show that the proposed new data valuation algorithm is as effective as the original expensive one. Furthermore, the customized MPC protocol is efficient and scalable.
翻译:数据估值是数据市场中的一项基本任务。 它的目的是公平补偿数据拥有者的贡献。 机器学习界日益认识到, Shapley 值 -- -- 合作游戏理论中的基本利润分享计划 -- -- 具有重要的价值, 因为它能独特地满足公平信用分配的基本属性, 并且证明它能够确定对模型性能有用或有害的数据源。 然而, 计算 Shapley 值需要访问原始数据源。 它仍然是一个开放的问题, 如何设计一个真实世界数据市场, 利用基于损耗值的数据定价, 同时保护隐私和允许公平支付。 在本文件中, 我们提出一个数据市场原型的模型, 以基于Splety值的数据源为基础, 以保密方式进行价值分配, 同时确保公平支付。 我们的方法是通过一套关于算法和系统设计的创新组合, 我们首先提出一个可以通过多党计算(MPC) 电路路法高效执行的“ ” 。 我们的主要想法是学习一个性能预测器, 可以直接预测一个模型性能, 以精确地预测基于Spplement C 的计算结果, 我们的计算成本, 我们的原始性价价的原值, 将一个原价计算法进一步显示, 我们的原价值是最佳的计算结果, 运行, 以最精确地计算, 我们的计算, 运行的计算, 运行的计算是, 以最高级的计算, 以最精确的计算。