A common approach to synthetic data is to sample from a fitted model. We show that under general assumptions, this approach results in a sample with inefficient estimators and whose joint distribution is inconsistent with the true distribution. Motivated by this, we propose a general method of producing synthetic data, which is widely applicable for parametric models, has asymptotically efficient summary statistics, and is both easily implemented and highly computationally efficient. Our approach allows for the construction of both partially synthetic datasets, which preserve certain summary statistics, as well as fully synthetic data which satisfy the strong guarantee of differential privacy (DP), both with the same asymptotic guarantees. We also provide theoretical and empirical evidence that the distribution from our procedure converges to the true distribution. Besides our focus on synthetic data, our procedure can also be used to perform approximate hypothesis tests in the presence of intractable likelihood functions.
翻译:常见的合成数据方法是从拟合模型中进行抽样。我们表明,在一般假设下,这种方法得到的样本具有低效的估计量,并且其联合分布与真实分布不一致。基于此,我们提出了一种通用的数据合成方法,适用于参数模型、具有渐近高效的汇总统计量,并且易于实现且计算效率高。我们的方法允许构建既保留某些汇总统计量的部分合成数据集,又满足差分隐私 (DP) 强保证的全合成数据,其均具有相同的渐近保证。此外,我们还提供了理论和实证证据,表明我们的过程所得到的分布收敛于真实分布。除了重点讨论合成数据外,我们的过程还可用于在不可计算的似然函数存在下进行近似假设检验。