Diferentially private (DP) synthetic datasets are a powerful approach for training machine learning models while respecting the privacy of individual data providers. The effect of DP on the fairness of the resulting trained models is not yet well understood. In this contribution, we systematically study the effects of differentially private synthetic data generation on classification. We analyze disparities in model utility and bias caused by the synthetic dataset, measured through algorithmic fairness metrics. Our first set of results show that although there seems to be a clear negative correlation between privacy and utility (the more private, the less accurate) across all data synthesizers we evaluated, more privacy does not necessarily imply more bias. Additionally, we assess the effects of utilizing synthetic datasets for model training and model evaluation. We show that results obtained on synthetic data can misestimate the actual model performance when it is deployed on real data. We hence advocate on the need for defining proper testing protocols in scenarios where differentially private synthetic datasets are utilized for model training and evaluation.
翻译:个人合成数据集是一种强有力的方法,用于培训机器学习模型,同时尊重个人数据提供者的隐私。DP对由此形成的经过培训的模型的公平性的影响还没有得到很好理解。在这一贡献中,我们系统地研究不同私人合成数据生成对分类的影响。我们分析了通过算法公平度量衡量的合成数据集在模型效用和偏差方面造成的差异。我们的第一套结果显示,虽然在我们评估的所有数据合成器中,隐私和效用(比较私人的,比较不准确的)之间似乎存在明显的负面关系,但隐私并不一定意味着更多的偏差。此外,我们评估利用合成数据集进行模型培训和模型评价的效果。我们表明,在将合成数据用于实际数据时,在合成数据上取得的结果可能误估实际模型性能。因此,我们主张,在模型培训和评价中使用差异性私人合成数据集的情景中,有必要确定适当的测试规程。