We propose a general, flexible, and scalable framework dpart, an open source Python library for differentially private synthetic data generation. Central to the approach is autoregressive modelling -- breaking the joint data distribution to a sequence of lower-dimensional conditional distributions, captured by various methods such as machine learning models (logistic/linear regression, decision trees, etc.), simple histogram counts, or custom techniques. The library has been created with a view to serve as a quick and accessible baseline as well as to accommodate a wide audience of users, from those making their first steps in synthetic data generation, to more experienced ones with domain expertise who can configure different aspects of the modelling and contribute new methods/mechanisms. Specific instances of dpart include Independent, an optimized version of PrivBayes, and a newly proposed model, dp-synthpop. Code: https://github.com/hazy/dpart
翻译:我们提出一个通用、灵活和可扩展的框架, 用于不同私人合成数据生成的开放源源 Python 图书馆 。 方法的核心是自动递减建模 -- -- 将联合数据发布打破到一个低维有条件分布序列,通过机器学习模型( 逻辑/线性回归、 决策树等)、 简单的直方图计数或定制技术等各种方法捕捉到。 创建该图书馆的目的是作为快速和可访问的基准,并容纳广大用户,从在合成数据生成方面迈出第一步的用户,到具有域域内专长、能够配置模型不同方面并提供新方法/机械学的更有经验的用户。 具体实例包括独立、优化版的PrivBayes和新提议的模型, dp- synthpop。 代码: https://github.com/hazy/dpart。