In this paper, we present the findings of various methodologies for measuring the similarity of synthetic data generated from tabular data samples. We particularly apply our research to the case where the synthetic data has many more samples than the real data. This task has a special complexity: validating the reliability of this synthetically generated data with a much higher number of samples than the original. We evaluated the most commonly used global metrics found in the literature. We introduced a novel approach based on the data's topological signature analysis. Topological data analysis has several advantages in addressing this latter challenge. The study of qualitative geometric information focuses on geometric properties while neglecting quantitative distance function values. This is especially useful with high-dimensional synthetic data where the sample size has been significantly increased. It is comparable to introducing new data points into the data space within the limits set by the original data. Then, in large synthetic data spaces, points will be much more concentrated than in the original space, and their analysis will become much more sensitive to both the metrics used and noise. Instead, the concept of "closeness" between points is used for qualitative geometric information. Finally, we suggest an approach based on data Eigen vectors for evaluating the level of noise in synthetic data. This approach can also be used to assess the similarity of original and synthetic data.
翻译:在本文中,我们介绍了用于衡量从表格数据样本中产生的合成数据的相似性的各种方法的研究结果。我们特别将研究运用于合成数据具有比真实数据多得多的样本的情况。这一任务特别复杂:用比原始数据数量多得多的样本来验证合成产生的数据的可靠性;我们评估了文献中最常用的全球指标;我们采用了基于数据地形特征分析的新颖方法;地形数据分析在应对后一种挑战方面有若干优势。定性几何信息研究侧重于几何特性,而忽略了定量距离函数值。这对于高度合成数据特别有用,因为样本规模已经大幅提高。这与在原始数据设定的限度内将新的数据点引入数据空间相类似。随后,在大型合成数据空间,各点将比原始空间更为集中,其分析将更加敏感。相反,定性几何信息中使用了“接近性”两个点的概念。最后,我们建议以原始数据Eigen矢量为基础,对原始数据进行类似的合成矢量进行评估。