A common approach to synthetic data is to sample from a fitted model. We show that under general assumptions, this approach results in a sample with inefficient estimators and whose joint distribution is inconsistent with the true distribution. Motivated by this, we propose a general method of producing synthetic data, which is widely applicable for parametric models, has asymptotically efficient summary statistics, and is both easily implemented and highly computationally efficient. Our approach allows for the construction of both partially synthetic datasets, which preserve certain summary statistics, as well as fully synthetic data which satisfy the strong guarantee of differential privacy (DP), both with the same asymptotic guarantees. We also provide theoretical and empirical evidence that the distribution from our procedure converges to the true distribution. Besides our focus on synthetic data, our procedure can also be used to perform approximate hypothesis tests in the presence of intractable likelihood functions.
翻译:合成数据的一个共同方法是从一个合适的模型中抽样。我们表明,在一般假设下,该方法的结果是抽样使用效率低下的估算器,其共同分布与真实分布不相符。我们为此提出一个普遍适用于参数模型的合成数据生产通用方法,该方法具有暂时有效的简要统计数据,而且易于实施,而且计算效率也很高。我们的方法允许建立部分合成数据集,保存某些摘要统计数据,以及完全合成数据,既能满足差异隐私的有力保障(DP),又能提供同样的同步保证。我们还提供理论和经验证据,证明我们程序分配的合成数据与真实分布一致。除了我们注重合成数据外,我们的程序还可以用来在难以找到的可能性功能的情况下进行大致的假设测试。