Generative models trained with Differential Privacy (DP) can be used to generate synthetic data while minimizing privacy risks. We analyze the impact of DP on these models vis-a-vis underrepresented classes/subgroups of data, specifically, studying: 1) the size of classes/subgroups in the synthetic data, and 2) the accuracy of classification tasks run on them. We also evaluate the effect of various levels of imbalance and privacy budgets. Our analysis uses three state-of-the-art DP models (PrivBayes, DP-WGAN, and PATE-GAN) and shows that DP yields opposite size distributions in the generated synthetic data. It affects the gap between the majority and minority classes/subgroups; in some cases by reducing it (a "Robin Hood" effect) and, in others, by increasing it (a "Matthew" effect). Either way, this leads to (similar) disparate impacts on the accuracy of classification tasks on the synthetic data, affecting disproportionately more the underrepresented subparts of the data. Consequently, when training models on synthetic data, one might incur the risk of treating different subpopulations unevenly, leading to unreliable or unfair conclusions.
翻译:以不同隐私(DP)培训的生成模型可用于生成合成数据,同时尽量减少隐私风险。我们分析了DP对这些模型相对于代表性不足的类别/分组数据的影响,特别是研究:(1) 合成数据中的类别/分组规模,(2) 对其执行的分类任务的准确性。我们还评估了不同程度的不平衡和隐私预算的影响。我们的分析使用了三种最先进的DP模型(PrivBayes、DP-WGAN和PATE-GAN),并表明DP在生成的合成数据中产生不同大小的分布。在某些情况下,它影响到多数类和少数类/分组之间的差距;在某些情况下,通过减少这一差距(“Robin Hood”效应),而在另一些情况下,通过增加分类任务(“Matthew”效应),这导致(相似的)对合成数据分类任务的准确性产生不同影响,对代表性不足的数据子部分的影响更大。因此,当关于合成数据的培训模型的培训模型可能带来对不同分组进行不均匀或不公平结论的风险。