A number of measures have been proposed to assess the utility of the synthetic data. These include measures based on distances between the two distributions and others based on combining the original and synthetic data and predicting the origin with a propensity score. The methods will be reviewed and compared, and relations between them illustrated. These measures are incorporated into utility modules in the \pkg{synthpop} package that include methods to visualize the results. We illustrate how to compare diffent syntheses and to diagnose which aspect of the synthetic data differs from the original. The utility functions were originally designed to be used for synthetic data objects of class synds, created by synthpop, but they can now be used to compare synthetic data created by other methods with the original records. The utility measures can be standardized by their expected Null distributions from a correct synthesis model. If they are used to evaluate other types of altered data, not generated from a model, then this standardisation can be interpreted as giving the ratio of the difference for the original to the expected stochastic error.
翻译:为了评估合成数据的效用,提出了一些措施建议来评估合成数据的效用,其中包括基于将原始和合成数据合并并用偏差来预测来源的原始和合成数据与其它数据之间的距离的措施。将审查和比较这些方法,并说明它们之间的关系。这些措施被纳入了\pkg{synthpop}软件包中的实用模块,其中包括对结果进行可视化的方法。我们说明了如何比较混杂的合成综合数据,并诊断合成数据中哪些方面与原始数据不同。这些实用功能最初设计用于合成分类符号的合成数据对象,由合成棒生成,但现在可以用来将其他方法生成的合成数据与原始记录进行比较。这些实用措施可以通过预期的合成模型的Null分布标准化。如果使用它们来评价其他类型的已变数据,而不是从模型中生成的,那么这种标准化可以解释为将原始数据与预期的随机错误的差比。