One of the most effective approaches to improving the performance of a machine-learning model is to acquire additional training data. To do so, a model owner may seek to acquire relevant training data from a data owner. Before procuring the data, the model owner needs to appraise the data. However, the data owner generally does not want to share the data until after an agreement is reached. The resulting Catch-22 prevents efficient data markets from forming. To address this problem, we develop data appraisal methods that do not require data sharing by using secure multi-party computation. Specifically, we study methods that: (1) compute parameter gradient norms, (2) perform model fine-tuning, and (3) compute influence functions. Our experiments show that influence functions provide an appealing trade-off between high-quality appraisal and required computation.
翻译:改善机器学习模式绩效的最有效方法之一是获取更多的培训数据。为此,模型所有人可以寻求从数据所有人那里获取相关的培训数据。在获取数据之前,模型所有人需要评估数据。然而,数据所有人一般不愿意在达成协议之前共享数据。由此产生的Catch-22防止了高效数据市场形成。为了解决这一问题,我们开发了数据评估方法,不需要通过安全的多方计算共享数据。具体地说,我们研究的方法有:(1) 计算参数梯度规范,(2) 进行模型微调,(3) 计算影响功能。我们的实验显示,影响功能在高质量评估和所需计算之间提供了一种有吸引力的权衡。