Statistical testing is widespread and critical for a variety of scientific disciplines. The advent of machine learning and the increase of computing power has increased the interest in the analysis and statistical testing of multidimensional data. We extend the powerful Kolmogorov-Smirnov two sample test to a high dimensional form in a similar manner to Fasano (Fasano, 1987). We call our result the d-dimensional Kolmogorov-Smirnov test (ddKS) and provide three novel contributions therewith: we develop an analytical equation for the significance of a given ddKS score, we provide an algorithm for computation of ddKS on modern computing hardware that is of constant time complexity for small sample sizes and dimensions, and we provide two approximate calculations of ddKS: one that reduces the time complexity to linear at larger sample sizes, and another that reduces the time complexity to linear with increasing dimension. We perform power analysis of ddKS and its approximations on a corpus of datasets and compare to other common high dimensional two sample tests and distances: Hotelling's T^2 test and Kullback-Leibler divergence. Our ddKS test performs well for all datasets, dimensions, and sizes tested, whereas the other tests and distances fail to reject the null hypothesis on at least one dataset. We therefore conclude that ddKS is a powerful multidimensional two sample test for general use, and can be calculated in a fast and efficient manner using our parallel or approximate methods. Open source implementations of all methods described in this work are located at https://github.com/pnnl/ddks.
翻译:对各种科学学科来说,统计测试是广泛和关键的。机器学习的到来和计算能力的增加提高了对多维数据的分析和统计测试的兴趣。我们以与法萨诺(法萨诺,1987年)相似的方式,将强大的科尔莫戈罗夫-斯米尔诺夫两次抽样测试推广到高维的测试。我们称我们的结果为D-维科莫戈洛夫-斯密诺夫(ddKS)测试(ddKS),并提供三种新的贡献:我们为特定 ddKS 评分的重要性开发了一个分析方程式,我们提供了计算现代计算机硬件的dKS值的算法,这种计算对于小样本大小和尺寸来说具有持续的时间复杂性。我们提供了两种对dKS的大致计算方法:将时间复杂性降低到线性,将时间复杂性降低到线性,而将时间复杂性降低到线性,在数据集的集合上进行权力分析,并与其他常见的高级度两次抽样测试和距离比较:T&2号测试和KRullback-Leeb-lib-libr