Increasing interest in privacy-preserving machine learning has led to new models for synthetic private data generation from undisclosed real data. However, mechanisms of privacy preservation introduce artifacts in the resulting synthetic data that have a significant impact on downstream tasks such as learning predictive models or inference. In particular, bias can affect all analyses as the synthetic data distribution is an inconsistent estimate of the real-data distribution. We propose several bias mitigation strategies using privatized likelihood ratios that have general applicability to differentially private synthetic data generative models. Through large-scale empirical evaluation, we show that bias mitigation provides simple and effective privacy-compliant augmentation for general applications of synthetic data. However, the work highlights that even after bias correction significant challenges remain on the usefulness of synthetic private data generators for tasks such as prediction and inference.
翻译:对保护隐私的机器学习的兴趣日益浓厚,由此产生了利用未披露的真实数据生成合成私人数据的新模式,然而,保护隐私的机制在由此产生的合成数据中引入了人工制品,从而对学习预测模型或推断等下游任务产生重大影响,特别是,偏见可能影响所有分析,因为合成数据分布是对真实数据分布的不一致性估计。我们提出若干减少偏见的战略,采用私有化的可能性比率,对不同的私人合成数据基因化模型具有普遍适用性。通过大规模的经验评估,我们表明,减少偏见为合成数据的一般应用提供了简单而有效的符合隐私的增强。然而,这项工作突出表明,即使在纠正偏见之后,合成私人数据生成者对预测和推断等任务的效用仍然存在重大挑战。