Distance correlation has gained much recent attention in the data science community: the sample statistic is straightforward to compute and asymptotically equals zero if and only if independence, making it an ideal choice to discover any type of dependency structure given sufficient sample size. One major bottleneck is the testing process: because the null distribution of distance correlation depends on the underlying random variables and metric choice, it typically requires a permutation test to estimate the null and compute the p-value, which is very costly for large amount of data. To overcome the difficulty, in this paper we propose a chi-square test for distance correlation. Method-wise, the chi-square test is non-parametric, extremely fast, and applicable to bias-corrected distance correlation using any strong negative type metric or characteristic kernel. The test exhibits a similar testing power as the standard permutation test, and can be utilized for K-sample and partial testing. Theory-wise, we show that the underlying chi-square distribution well approximates and dominates the limiting null distribution in upper tail, prove the chi-square test can be valid and universally consistent for testing independence, and establish a testing power inequality with respect to the permutation test.
翻译:数据科学界最近非常关注远程相关关系:抽样统计直截了当,只有在独立的情况下,才能计算零,且无瞬间等于零,使发现任何类型的依赖结构具有足够样本大小的理想选择成为理想的选择。一个主要瓶颈是测试过程:由于距离相关性的无效分布取决于潜在的随机变量和量度选择,因此通常需要一次变换测试来估计纯值和计算 p值,这对大量数据来说成本很高。为了克服这一困难,我们在本文件中提议对距离相关关系进行奇平方测试。从方法上看,奇平方测试是非对准的,非常快速的,并且适用于偏差修正的距离相关关系,使用任何强的负型指标或特性内核。测试显示类似于标准调测试的测试力,可用于K-sample和部分测试。从理论上看,我们表明,基正方分布非常接近并控制着限制上尾部无线分布的测试。从上尾部的测算,证明奇平方测试是有效的,并且符合每个测试的独立性。