Training even moderately-sized generative models with differentially-private stochastic gradient descent (DP-SGD) is difficult: the required level of noise for reasonable levels of privacy is simply too large. We advocate instead building off a good, relevant representation on an informative public dataset, then learning to model the private data with that representation. In particular, we minimize the maximum mean discrepancy (MMD) between private target data and a generator's distribution, using a kernel based on perceptual features learned from a public dataset. With the MMD, we can simply privatize the data-dependent term once and for all, rather than introducing noise at each step of optimization as in DP-SGD. Our algorithm allows us to generate CIFAR10-level images with $\epsilon \approx 2$ which capture distinctive features in the distribution, far surpassing the current state of the art, which mostly focuses on datasets such as MNIST and FashionMNIST at a large $\epsilon \approx 10$. Our work introduces simple yet powerful foundations for reducing the gap between private and non-private deep generative models.
翻译:培训甚至中等规模的基因模型,具有不同私人的随机梯度梯度(DP-SGD)是困难的:合理隐私水平所需的噪音水平实在太高了。我们主张在信息丰富的公共数据集上建立良好的相关代表,然后学习以这种代表方式模拟私人数据。特别是,我们尽可能缩小私人目标数据和发电机分布之间的最大平均差异(MMD),使用基于从公共数据集中汲取的感知特征的内核。有了MD,我们就可以一劳永逸地将数据依赖的术语私有化,而不是像DP-SGD那样在优化的每一个步骤中引入噪音。我们的算法允许我们用美元\epselon\approx 2美元生成CIRA10级图像,这些图像在分布上具有独特的特征,远远超过了目前艺术状态,它主要侧重于诸如MNIST和FAshionMNIST等数据集,其价值高达$\epslon\aprox 10美元。我们的工作为缩小私人和非私人深层基因模型之间的差距提供了简单而强大的基础。