Organizations are increasingly relying on data to support decisions. When data contains private and sensitive information, the data owner often desires to publish a synthetic database instance that is similarly useful as the true data, while ensuring the privacy of individual data records. Existing differentially private data synthesis methods aim to generate useful data based on applications, but they fail in keeping one of the most fundamental data properties of the structured data -- the underlying correlations and dependencies among tuples and attributes (i.e., the structure of the data). This structure is often expressed as integrity and schema constraints, or with a probabilistic generative process. As a result, the synthesized data is not useful for any downstream tasks that require this structure to be preserved. This work presents Kamino, a data synthesis system to ensure differential privacy and to preserve the structure and correlations present in the original dataset. Kamino takes as input of a database instance, along with its schema (including integrity constraints), and produces a synthetic database instance with differential privacy and structure preservation guarantees. We empirically show that while preserving the structure of the data, Kamino achieves comparable and even better usefulness in applications of training classification models and answering marginal queries than the state-of-the-art methods of differentially private data synthesis.
翻译:当数据包含私人和敏感信息时,数据所有人往往希望公布一个与真实数据同样有用的综合数据库实例,同时确保个人数据记录的隐私; 现有的有差别的私人数据合成方法旨在根据应用产生有用的数据,但未能保持结构化数据的最基本数据属性之一 -- -- 结构化数据的内在关联和依赖性(即数据结构),这种结构往往表现为完整性和系统化制约,或具有概率性基因化过程。因此,综合数据对于需要保存这一结构的任何下游任务都无用处。这项工作提出了Kamino,这是一个数据综合系统,以确保有差异的隐私,并维护原始数据集中存在的结构和相关性。 Kamino作为数据库实例的输入,连同其系统化(包括完整性制约),并产生一个综合数据库实例,具有不同的隐私和结构保护保证。我们从经验上表明,在维护数据结构的同时,Kaminos在应用差异化的培训模型和边缘性数据合成方法方面实现了可比的甚至更好的效用。