关于私人制表合成数据所培训模型部署情况的分析:意外惊喜 (An Analysis of the Deployment of Models Trained on Private Tabular Synthetic Data: Unexpected Surprises)

Diferentially private (DP) synthetic datasets are a powerful approach for training machine learning models while respecting the privacy of individual data providers. The effect of DP on the fairness of the resulting trained models is not yet well understood. In this contribution, we systematically study the effects of differentially private synthetic data generation on classification. We analyze disparities in model utility and bias caused by the synthetic dataset, measured through algorithmic fairness metrics. Our first set of results show that although there seems to be a clear negative correlation between privacy and utility (the more private, the less accurate) across all data synthesizers we evaluated, more privacy does not necessarily imply more bias. Additionally, we assess the effects of utilizing synthetic datasets for model training and model evaluation. We show that results obtained on synthetic data can misestimate the actual model performance when it is deployed on real data. We hence advocate on the need for defining proper testing protocols in scenarios where differentially private synthetic datasets are utilized for model training and evaluation.

翻译：个人合成数据集是一种强有力的方法,用于培训机器学习模型,同时尊重个人数据提供者的隐私。DP对由此形成的经过培训的模型的公平性的影响还没有得到很好理解。在这一贡献中,我们系统地研究不同私人合成数据生成对分类的影响。我们分析了通过算法公平度量衡量的合成数据集在模型效用和偏差方面造成的差异。我们的第一套结果显示,虽然在我们评估的所有数据合成器中,隐私和效用(比较私人的,比较不准确的)之间似乎存在明显的负面关系,但隐私并不一定意味着更多的偏差。此外,我们评估利用合成数据集进行模型培训和模型评价的效果。我们表明,在将合成数据用于实际数据时,在合成数据上取得的结果可能误估实际模型性能。因此,我们主张,在模型培训和评价中使用差异性私人合成数据集的情景中,有必要确定适当的测试规程。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/