When a database is protected by Differential Privacy (DP), its usability is limited in scope. In this scenario, generating a synthetic version of the data that mimics the properties of the private data allows users to perform any operation on the synthetic data, while maintaining the privacy of the original data. Therefore, multiple works have been devoted to devising systems for DP synthetic data generation. However, such systems may preserve or even magnify properties of the data that make it unfair, endering the synthetic data unfit for use. In this work, we present PreFair, a system that allows for DP fair synthetic data generation. PreFair extends the state-of-the-art DP data generation mechanisms by incorporating a causal fairness criterion that ensures fair synthetic data. We adapt the notion of justifiable fairness to fit the synthetic data generation scenario. We further study the problem of generating DP fair synthetic data, showing its intractability and designing algorithms that are optimal under certain assumptions. We also provide an extensive experimental evaluation, showing that PreFair generates synthetic data that is significantly fairer than the data generated by leading DP data generation mechanisms, while remaining faithful to the private data.
翻译:在数据库受到差分隐私(DP)保护时,其可用性受到限制。在这种情况下,生成模拟私有数据属性的合成数据版本允许用户对合成数据执行任何操作,同时保持原始数据的隐私。因此,已经有多个工作致力于设计用于 DP 合成数据生成的系统。然而,这种系统可能会保留或甚至放大使其不公平的数据属性,从而使合成数据不能使用。在这项工作中,我们提出了 PreFair,这是一个能够实现 DP 公平合成数据生成的系统。PreFair 扩展了当前的 DP 数据生成机制,通过纳入因果公平标准,确保了合成数据的公平性。我们改编了可证明的公正性概念,以适应合成数据生成情景的要求。我们进一步研究了生成 DP 公平合成数据的问题,展示了其难以处理的特点,并设计了在某些假设条件下的最优算法。我们还提供了广泛的实验评估,表明 PreFair 生成的合成数据比领先的 DP 数据生成机制所生成的数据显著更加公平,同时保持对私有数据的忠实性。