Over the last decade, an approach that has gained a lot of popularity to tackle non-parametric testing problems on general (i.e., non-Euclidean) domains is based on the notion of reproducing kernel Hilbert space (RKHS) embedding of probability distributions. The main goal of our work is to understand the optimality of two-sample tests constructed based on this approach. First, we show that the popular MMD (maximum mean discrepancy) two-sample test is not optimal in terms of the separation boundary measured in Hellinger distance. Second, we propose a modification to the MMD test based on spectral regularization by taking into account the covariance information (which is not captured by the MMD test) and prove the proposed test to be minimax optimal with a smaller separation boundary than that achieved by the MMD test. Third, we propose an adaptive version of the above test which involves a data-driven strategy to choose the regularization parameter and show the adaptive test to be almost minimax optimal up to a logarithmic factor. Moreover, our results hold for the permutation variant of the test where the test threshold is chosen elegantly through the permutation of the samples. Through numerical experiments on synthetic and real-world data, we demonstrate the superior performance of the proposed test in comparison to the MMD test.
翻译:过去十年来,一种办法在一般(即非欧洲-欧洲)域的非参数测试问题上获得了很大支持,这种办法在解决一般(即非欧洲-加勒比)域的非参数测试问题方面得到了很大支持,其依据是复制内核Hilbert空间(RKHS)嵌入概率分布的概念。我们工作的主要目标是了解以这种方法为基础的双样测试的最佳性能。首先,我们表明流行的MMD(最大平均差异)双样测试在海灵格距离测量的分离界限方面不是最佳的。第二,我们建议修改光谱正规化的MMD测试,其中考虑到变异性信息(MMD测试没有捕捉到这些信息),并证明拟议的测试是小于MMD测试所达到的分界的最小性能测试。第三,我们提出了上述测试的适应性版本,它涉及以数据驱动的战略,选择正规化参数,显示适应性测试几乎是最低到对数因素的最佳。此外,我们为光谱化的光谱化测试结果维持了高端变量,我们所选择的高级测试标准是高压测试。