Differentially private stochastic gradient descent privatizes model training by injecting noise into each iteration, where the noise magnitude increases with the number of model parameters. Recent works suggest that we can reduce the noise by leveraging public data for private machine learning, by projecting gradients onto a subspace prescribed by the public data. However, given a choice of public datasets, it is not a priori clear which one may be most appropriate for the private task. We give an algorithm for selecting a public dataset by measuring a low-dimensional subspace distance between gradients of the public and private examples. We provide theoretical analysis demonstrating that the excess risk scales with this subspace distance. This distance is easy to compute and robust to modifications in the setting. Empirical evaluation shows that trained model accuracy is monotone in this distance.
翻译:不同私人的随机梯度梯度梯度梯度私有化模式培训,在每迭代中注入噪音,使噪音数量随着模型参数的增加而增加。最近的工作表明,我们可以通过利用公共数据进行私人机器学习,将梯度投射到公共数据指定的子空间上,从而减少噪音。然而,考虑到公共数据集的选择,并不清楚哪些数据最适合私人任务。我们给出了一种算法,通过测量公共与私人数字梯度之间的低维次空间距离来选择公共数据集。我们提供了理论分析,表明这种子空间距离的超重风险尺度。这种距离很容易计算,而且对于环境的修改也比较有力。经验评估显示,经过培训的模型精度是这一距离的单体。</s>