We revisit the problem of using public data to improve the privacy/utility trade-offs for differentially private (DP) model training. Here, public data refers to auxiliary data sets that have no privacy concerns. We consider public data that is from the same distribution as the private training data. For convex losses, we show that a variant of Mirror Descent provides population risk guarantees which are independent of the dimension of the model ($p$). Specifically, we apply Mirror Descent with the loss generated by the public data as the mirror map, and using DP gradients of the loss generated by the private (sensitive) data. To obtain dimension independence, we require $G_Q^2 \leq p$ public data samples, where $G_Q$ is a measure of the isotropy of the loss function. We further show that our algorithm has a natural ``noise stability'' property: If around the current iterate the public loss satisfies $\alpha_v$-strong convexity in a direction $v$, then using noisy gradients instead of the exact gradients shifts our next iterate in the direction $v$ by an amount proportional to $1/\alpha_v$ (in contrast with DP-SGD, where the shift is isotropic). Analogous results in prior works had to explicitly learn the geometry using the public data in the form of preconditioner matrices. Our method is also applicable to non-convex losses, as it does not rely on convexity assumptions to ensure DP guarantees. We demonstrate the empirical efficacy of our algorithm by showing privacy/utility trade-offs on linear regression, deep learning benchmark datasets (WikiText-2, CIFAR-10, and EMNIST), and in federated learning (StackOverflow). We show that our algorithm not only significantly improves over traditional DP-SGD and DP-FedAvg, which do not have access to public data, but also improves over DP-SGD and DP-FedAvg on models that have been pre-trained with the public data to begin with.
翻译:我们重新审视了使用公共数据改善隐私/公用数据交换差异私人(DP)模式培训的问题。在这里,公共数据是指无隐私关切的辅助数据集。我们认为公共数据与私人培训数据相同。对于 convex 损失,我们显示一个“镜形源”变量提供了独立于模型维度的人口风险保障。具体地说,我们应用“镜形源”作为镜形地图,使用由公共数据引起的损失来改善隐私/公用数据交换。为了获得维度独立,我们需要“G%2\leq p$的辅助数据集,没有隐私关切。我们认为“美元”是来自与私人培训数据相同的分布数据。我们进一步显示,我们的算法具有自然的“稳定”属性:如果在目前公共损失的深度上, $alpha_venty compretailated, 以美元为方向,然后使用“热度梯度梯度”而不是将我们下一次的驱动数据转换为方向上“美元”Squreal-deal-dealalalalalal-deal dal dies 。我们的数据在“DP-deal-deal-demode”中也显示了“Dal-destreval-demodemodealtal dal drodudududududududududududude d)。