Generative models have been found effective for data synthesis due to their ability to capture complex underlying data distributions. The quality of generated data from these models is commonly evaluated by visual inspection for image datasets or downstream analytical tasks for tabular datasets. These evaluation methods neither measure the implicit data distribution nor consider the data privacy issues, and it remains an open question of how to compare and rank different generative models. Medical data can be sensitive, so it is of great importance to draw privacy concerns of patients while maintaining the data utility of the synthetic dataset. Beyond the utility evaluation, this work outlines two metrics called Similarity and Uniqueness for sample-wise assessment of synthetic datasets. We demonstrate the proposed notions with several state-of-the-art generative models to synthesise Cystic Fibrosis (CF) patients' electronic health records (EHRs), observing that the proposed metrics are suitable for synthetic data evaluation and generative model comparison.
翻译:由于能够捕捉复杂的原始数据分布,生成模型被认为对数据合成十分有效。这些模型产生的数据的质量通常通过图像数据集的直观检查或表格数据集的下游分析任务加以评价。这些评价方法既不衡量隐含的数据分布,也不考虑数据隐私问题,这仍然是如何比较和划分不同基因化模型的未决问题。医学数据可能是敏感的,因此在保持合成数据集的数据效用的同时,吸引病人对隐私的关切非常重要。除了实用性评估外,这项工作还概述了两个指标,即合成数据集抽样评估的相似性和独特性。我们用几种最先进的基因化模型展示了拟议的概念,这些模型用于合成Cystic Fibrois(CF)病人的电子健康记录,指出拟议的指标适合于合成数据评价和基因化模型比较。