This work proposes a method to evaluate the similarity between low-sample tabular data and synthetically generated data with a larger number of samples than the original. The technique is known to as data augmentation. However, significance values derived from non-parametric tests are questionable when the sample size is limited. Our approach uses a combination of geometry, topology, and robust statistics for hypothesis testing to evaluate the "validity" of generated data. We additionally contrast the findings with prominent global metric practices described in the literature for large sample size data.
翻译:暂无翻译