In this paper, we propose a method for measuring the similarity low sample tabular data with synthetically generated data with a larger number of samples than original. This process is also known as data augmentation. But significance levels obtained from non-parametric tests are suspect when sample size is small. Our method uses a combination of geometry, topology and robust statistics for hypothesis testing in order to compare the validity of generated data. We also compare the results with common global metric methods available in the literature for large sample size data.
翻译:在本文中,我们提出了一个方法,用合成生成的数据来衡量相似性低样本表列数据,其样本数量多于原样本。这一过程也称为数据增加。但是,在样本规模小时,从非参数测试中获得的重要程度是值得怀疑的。我们的方法将几何、地貌学和可靠统计数据结合起来进行假设测试,以比较生成数据的有效性。我们还将结果与文献中为大样本规模数据提供的通用全球计量方法进行比较。