Generative machine learning models are being increasingly viewed as a way to share sensitive data between institutions. While there has been work on developing differentially private generative modeling approaches, these approaches generally lead to sub-par sample quality, limiting their use in real world applications. Another line of work has focused on developing generative models which lead to higher quality samples but currently lack any formal privacy guarantees. In this work, we propose the first formal framework for membership privacy estimation in generative models. We formulate the membership privacy risk as a statistical divergence between training samples and hold-out samples, and propose sample-based methods to estimate this divergence. Compared to previous works, our framework makes more realistic and flexible assumptions. First, we offer a generalizable metric as an alternative to the accuracy metric especially for imbalanced datasets. Second, we loosen the assumption of having full access to the underlying distribution from previous studies , and propose sample-based estimations with theoretical guarantees. Third, along with the population-level membership privacy risk estimation via the optimal membership advantage, we offer the individual-level estimation via the individual privacy risk. Fourth, our framework allows adversaries to access the trained model via a customized query, while prior works require specific attributes.
翻译:机构之间的敏感数据交流正在日益被视为产生机器学习模型的一种方式。虽然在开发差别化的私人基因模型方法方面已经开展了工作,但这些方法通常导致次等样本质量,限制了其在现实世界应用中的使用。另一方面的工作重点是开发基因模型,导致质量更高的样本,但目前缺乏任何正式的隐私保障。在这项工作中,我们提出了在基因模型中成员隐私估算的第一个正式框架。我们把会员隐私风险作为培训样本和搁置样本之间的统计差异来进行,并提出基于样本的方法来估计这一差异。与以往的工作相比,我们的框架提出了更现实和灵活的假设。首先,我们提出了一个通用的衡量标准,作为准确度指标的替代,特别是用于不平衡数据集的。第二,我们放宽了从以往研究中充分获得基本分布的假设,并提出了带有理论保障的基于样本的估计。第三,我们通过最佳会员优势来估算成员隐私风险,我们通过个人隐私风险提供个人层面的估计。第四,我们的框架允许对手通过定制的查询获取经过培训的模型,而之前的工作需要具体的属性。