We study valuing the data of a data owner/seller for a data seeker/buyer. Data valuation is often carried out for a specific task assuming a particular utility metric, such as test accuracy on a validation set, that may not exist in practice. In this work, we focus on task-agnostic data valuation without any validation requirements. The data buyer has access to a limited amount of data (which could be publicly available) and seeks more data samples from a data seller. We formulate the problem as estimating the differences in the statistical properties of the data at the seller with respect to the baseline data available at the buyer. We capture these statistical differences through second moment by measuring diversity and relevance of the seller's data for the buyer; we estimate these measures through queries to the seller without requesting raw data. We design the queries with the proposed approach so that the seller is blind to the buyer's raw data and has no knowledge to fabricate responses to queries to obtain a desired outcome of the diversity and relevance trade-off.We will show through extensive experiments on real tabular and image datasets that the proposed estimates capture the diversity and relevance of the seller's data for the buyer.
翻译:我们研究数据所有人/卖方的数据,以寻找数据者/买方为数据查询者/买方; 数据评估往往是为一项特定任务而进行,其中假定一种特定有用指标,例如验证数据集的测试准确性,而实际上可能并不存在; 在这项工作中,我们侧重于任务机密性数据评估,而没有任何验证要求; 数据买方可以获得数量有限的数据(可以公开提供),并从数据卖方寻求更多的数据样本; 我们将问题表述为估计卖方数据在买方现有基线数据方面的统计性质差异; 我们通过衡量卖方数据的多样性和相关性,在第二时刻捕捉到这些统计差异; 我们通过询问卖方,而不要求提供原始数据,来估计这些措施; 我们设计询问方法,使卖方对买方原始数据视而不见,并且不知道对询问作出何种答复,以获得多样性和相关性交易的预期结果。 我们将通过对真实表格和图像数据集进行广泛的试验,表明拟议的估计数能够反映卖方数据的多样性和对买方的相关性。