Synthetic data generation has become a key ingredient for training machine learning procedures, addressing tasks such as data augmentation, analysing privacy-sensitive data, or visualising representative samples. Assessing the quality of such synthetic data generators hence has to be addressed. As (deep) generative models for synthetic data often do not admit explicit probability distributions, classical statistical procedures for assessing model goodness-of-fit may not be applicable. In this paper, we propose a principled procedure to assess the quality of a synthetic data generator. The procedure is a kernelised Stein discrepancy (KSD)-type test which is based on a non-parametric Stein operator for the synthetic data generator of interest. This operator is estimated from samples which are obtained from the synthetic data generator and hence can be applied even when the model is only implicit. In contrast to classical testing, the sample size from the synthetic data generator can be as large as desired, while the size of the observed data, which the generator aims to emulate is fixed. Experimental results on synthetic distributions and trained generative models on synthetic and real datasets illustrate that the method shows improved power performance compared to existing approaches.
翻译:合成数据的生成已成为培训机器学习程序的一个关键要素,涉及诸如数据增强、分析隐私敏感数据或可视化代表性样本等任务。因此,必须评估此类合成数据生成者的质量。由于合成数据的(深)基因化模型往往不明显概率分布,因此合成数据的(精密)基因化模型往往不明显,评估模型适当性能的典型统计程序可能不适用。在本文件中,我们提议了一个评估合成数据生成者质量的原则性程序。该程序是基于合成数据生成者非参数性 Stein 操作器(KSD) 型的内嵌式测试。该操作器是从合成数据生成器获得的样本中估算的,因此即使该模型只是隐含,也可以应用。与典型测试相比,合成数据生成器的样本大小可能不那么大,而观察到的生成器要效仿的数据的大小是固定的。合成分布的实验结果和经过培训的合成和真实数据集的基因化模型表明,与现有方法相比,该方法的功率表现有所改善。