Data trading is essential to accelerate the development of data-driven machine learning pipelines. The central problem in data trading is to estimate the utility of a seller's dataset with respect to a given buyer's machine learning task, also known as data valuation. Typically, data valuation requires one or more participants to share their raw dataset with others, leading to potential risks of intellectual property (IP) violations. In this paper, we tackle the novel task of preemptively protecting the IP of datasets that need to be shared during data valuation. First, we identify and formalize two kinds of novel IP risks in visual datasets: data-item (image) IP and statistical (dataset) IP. Then, we propose a novel algorithm to convert the raw dataset into a sanitized version, that provides resistance to IP violations, while at the same time allowing accurate data valuation. The key idea is to limit the transfer of information from the raw dataset to the sanitized dataset, thereby protecting against potential intellectual property violations. Next, we analyze our method for the likely existence of a solution and immunity against reconstruction attacks. Finally, we conduct extensive experiments on three computer vision datasets demonstrating the advantages of our method in comparison to other baselines.
翻译:数据交易对于加速开发数据驱动的机器学习管道至关重要。数据交易的中心问题是估计卖方的数据集在特定买方的机器学习任务(也称为数据估值)中的效用。通常,数据估值需要一个或多个参与者与其他人分享原始数据集,从而可能导致知识产权侵犯的潜在风险。在本文件中,我们处理先发制人地保护在数据估值期间需要共享的数据集的IP这一新任务。首先,我们确定并正式确定视觉数据集中两种新型的IP风险:数据项目(模拟)IP和统计(数据集)IP。然后,我们提出一种新的算法,将原始数据集转换成一种防腐蚀的版本,提供对违反IP的阻力,同时允许准确的数据估值。关键的想法是限制将信息从原始数据集转移到已清理的数据集,从而防止潜在的知识产权侵犯。我们分析了我们可能存在的解决方案和免于重建攻击的新型IP风险。最后,我们就三种计算机愿景数据基准进行了广泛的实验,以展示我们其他基线的优势。