Machine learning (ML) models can leak information about users, and differential privacy (DP) provides a rigorous way to bound that leakage under a given budget. This DP budget can be regarded as a new type of compute resource in workloads of multiple ML models training on user data. Once it is used, the DP budget is forever consumed. Therefore, it is crucial to allocate it most efficiently to train as many models as possible. This paper presents the scheduler for privacy that optimizes for efficiency. We formulate privacy scheduling as a new type of multidimensional knapsack problem, called privacy knapsack, which maximizes DP budget efficiency. We show that privacy knapsack is NP-hard, hence practical algorithms are necessarily approximate. We develop an approximation algorithm for privacy knapsack, DPK, and evaluate it on microbenchmarks and on a new, synthetic private-ML workload we developed from the Alibaba ML cluster trace. We show that DPK: (1) often approaches the efficiency-optimal schedule, (2) consistently schedules more tasks compared to a state-of-the-art privacy scheduling algorithm that focused on fairness (1.3-1.7x in Alibaba, 1.0-2.6x in microbenchmarks), but (3) sacrifices some level of fairness for efficiency. Therefore, using DPK, DP ML operators should be able to train more models on the same amount of user data while offering the same privacy guarantee to their users.
翻译:机器学习( ML) 模式可以泄露用户信息, 且不同隐私(DP) 提供了严格的方式来约束特定预算下的漏洞。 这个 DP预算可以被视为一种新型的计算资源, 计算多个 ML 模式对用户数据的培训工作量。 一旦使用, DP预算将永远消耗。 因此, 最高效地分配它来培训尽可能多的模型。 本文展示了最优化地提高效率的隐私计划表。 我们把隐私安排作为一种新的多层面的多功能背包问题, 称为隐私背包, 最大限度地提高 DP 预算效率。 我们显示, 隐私 knapsack 是硬的, 因此实际算法必然是近似性的。 我们为隐私 knapsack、 DPK 开发了近似算算法, 并评估了它是如何尽可能多地培训更多的模型。 因此, DPK:(1) 通常采用同样的效率- 最佳计划表, (2) 持续地安排更多的任务, 与最先进的隐私安排模式相比, 以公平性为主, DP- 2 (1.3-1x ) 的用户应该提供更公平的 DP- 2 。