Robin Hood和Matthew效应 -- -- 不同隐私对合成数据的影响不同 (Robin Hood and Matthew Effects -- Differential Privacy Has Disparate Impact on Synthetic Data)

Generative models trained using Differential Privacy (DP) are increasingly used to produce and share synthetic data in a privacy-friendly manner. In this paper, we set out to analyze the impact of DP on these models vis-a-vis underrepresented classes and subgroups of data. We do so from two angles: 1) the size of classes and subgroups in the synthetic data, and 2) classification accuracy on them. We also evaluate the effect of various levels of imbalance and privacy budgets. Our experiments, conducted using three state-of-the-art DP models (PrivBayes, DP-WGAN, and PATE-GAN), show that DP results in opposite size distributions in the generated synthetic data. More precisely, it affects the gap between the majority and minority classes and subgroups, either reducing it (a "Robin Hood" effect) or increasing it ("Matthew" effect). However, both of these size shifts lead to similar disparate impacts on a classifier's accuracy, affecting disproportionately more the underrepresented subparts of the data. As a result, we call for caution when analyzing or training a model on synthetic data, or risk treating different subpopulations unevenly, which might also lead to unreliable conclusions.

翻译：使用不同隐私(DP)培训的生成模型越来越多地用于以方便隐私的方式制作和分享合成数据。在本文中,我们准备分析DP对这些模型相对于代表性不足的数据类别和分组的影响。我们从两个角度分析DP对这些模型的影响:1) 合成数据中的类别和分组规模,2) 分类准确性。我们还评估了不同程度的不平衡和隐私预算的影响。我们使用三种最先进的DP模型(PrivBayes、DP-WGAN和PATE-GAN)进行的实验表明,DP的结果在生成的合成数据中出现不同大小的分布。更准确地说,它影响到多数类和少数类和分组之间的差距,要么缩小(Robin Hood效应),要么增加(“Mathew”效应)。然而,这两种规模的变化都会导致对分类者准确性产生类似的不同影响,对数据代表性过大的子部分影响。结果,我们呼吁在分析或培训合成数据模型时谨慎对待不同次层的风险,因为后者可能会导致不可靠的。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日

【CMU-Spring2020课程】离散微分几何15讲，Discrete Differential Geometry

专知会员服务

55+阅读 · 2020年3月26日

【医学图像处理中的因果性】52页ppt，Causality Matters in Medical Imaging

专知会员服务

60+阅读 · 2020年3月14日

【干货51页PPT】深度学习理论理解探索

专知会员服务

66+阅读 · 2019年12月24日