Training even moderately-sized generative models with differentially-private stochastic gradient descent (DP-SGD) is difficult: the required level of noise for reasonable levels of privacy is simply too large. We advocate instead building off a good, relevant representation on an informative public dataset, then learning to model the private data with that representation. In particular, we minimize the maximum mean discrepancy (MMD) between private target data and a generator's distribution, using a kernel based on perceptual features learned from a public dataset. With the MMD, we can simply privatize the data-dependent term once and for all, rather than introducing noise at each step of optimization as in DP-SGD. Our algorithm allows us to generate CIFAR10-level images with $\epsilon \approx 2$ which capture distinctive features in the distribution, far surpassing the current state of the art, which mostly focuses on datasets such as MNIST and FashionMNIST at a large $\epsilon \approx 10$. Our work introduces simple yet powerful foundations for reducing the gap between private and non-private deep generative models. Our code is available at \url{https://github.com/ParkLabML/DP-MEPF}.
翻译:使用差分隐私随机梯度下降(DP-SGD)训练中等规模的生成模型是困难的:为了保证合理的隐私水平,所需的噪声量太大。相反,我们建议在相关的公共数据集上构建好且有意义的表示,然后使用该表示来学习对私人数据进行建模。特别地,我们最小化了私人目标数据和生成器分布之间的最大均值差异(MMD),使用基于从公共数据集中学习的感知特征的核函数。通过MMD,我们可以简单地将数据相关项一次性隐私化,而不是如DP-SGD那样在每一步优化时引入噪声。我们的算法使我们能够以$\epsilon \approx 2$的隐私预算,生成捕捉分布中独特特征的类似于 CIFAR10 的图像,远远超过目前的最新技术,大多数集中在 MNIST 和 FashionMNIST 等数据集上,隐私预算较大 ($\epsilon \approx 10$)。我们的工作提供了简单但强大的基础,以缩小隐私和非隐私深度生成模型之间的差距。我们的代码可在 \url{https://github.com/ParkLabML/DP-MEPF} 上获得。