Data valuation, especially quantifying data value in algorithmic prediction and decision-making, is a fundamental problem in data trading scenarios. The most widely used method is to define the data Shapley and approximate it by means of the permutation sampling algorithm. To make up for the large estimation variance of the permutation sampling that hinders the development of the data marketplace, we propose a more robust data valuation method using stratified sampling, named variance reduced data Shapley (VRDS for short). We theoretically show how to stratify, how many samples are taken at each stratum, and the sample complexity analysis of VRDS. Finally, the effectiveness of VRDS is illustrated in different types of datasets and data removal applications.
翻译:数据估值,特别是在算法预测和决策中量化数据价值,是数据交易假设情景中的一个根本问题,最广泛使用的方法是用变位抽样算法界定数据,并接近数据。为了弥补阻碍数据市场发展的变位抽样的巨大估计差异,我们提议采用更可靠的数据估值方法,使用分层抽样,点出变位数据变位法。我们理论上表明如何分层,每个层取取多少个样本,以及VRDS的抽样复杂程度分析。最后,VRDS的有效性体现在不同类型的数据集和数据删除应用中。