评估、可视化和改进合成数据的使用 (Assessing, visualizing and improving the utility of synthetic data)

The synthpop package for R https://www.synthpop.org.uk provides tools to allow data custodians to create synthetic versions of confidential microdata that can be distributed with fewer restrictions than the original. The synthesis can be customized to ensure that relationships evident in the real data are reproduced in the synthetic data. A number of measures have been proposed to assess this aspect, commonly known as the utility of the synthetic data. We show that all these measures, including those calculated from tabulations, can be derived from a propensity score model. The measures will be reviewed and compared, and relations between them illustrated. All the measures compared are highly correlated and some are shown to be identical. The method used to define the propensity score model is more important than the choice of measure. These measures and methods are incorporated into utility modules in the synthpop package that include methods to visualize the results and thus provide immediate feedback to allow the person creating the synthetic data to improve its quality. The utility functions were originally designed to be used for synthetic data objects of class \code{synds}, created by the \pkg{synthpop} function syn() or syn.strata(), but they can now be used to compare one or more synthesised data sets with the original records, where the records are R data frames or lists of data frames.

翻译：R https://www.synthpopp.org.uk 的合成棒软件包提供了工具,使数据保管人能够创建机密微观数据的合成版本,可以比原数据少一些限制地分发。合成可以定制,以确保在合成数据中复制真实数据中显示的关系。已提出一些措施来评估这一方面,通常称为合成数据的效用。我们显示,所有这些措施,包括制表方法,都可以从一个适应性评分模型中得出。将审查并比较这些措施,并展示它们之间的关系。所有比较的尺度都是高度关联的,有些显示是相同的。用来定义偏重度评模型的方法比计量的选择更为重要。这些措施和方法被纳入合成数据包中的实用模块,其中包括对结果进行直观分析的方法,从而提供即时反馈,使创建合成数据的人能够提高数据的质量。最初设计这些功能用于类合成数据对象\code{snd},由它们现在创建的所有措施都是高度关联的,有些则显示是相同的。用来定义的偏重度分模式,比使用的数据序列或图表中的原始数据框。