Kernel two-sample tests have been widely used and the development of efficient methods for high-dimensional large-scale data is gaining more and more attention as we are entering the big data era. However, existing methods, such as the maximum mean discrepancy (MMD) and recently proposed kernel-based tests for large-scale data, are computationally intensive to implement and/or ineffective for some common alternatives for high-dimensional data. In this paper, we propose a new test that exhibits high power for a wide range of alternatives. Moreover, the new test is more robust to high dimensions than existing methods and does not require optimization procedures for the choice of kernel bandwidth and other parameters by data splitting. Numerical studies show that the new approach performs well in both synthetic and real world data.
翻译:随着我们进入大数据时代,对高维大型数据的高效方法的开发正日益受到越来越多的注意,但现有方法,如最大平均差异(MMD)和最近提议的大规模数据内核测试,在计算上十分密集,以实施高维数据的某些通用替代品和(或)无效。在本文中,我们提议一项新的测试,为多种替代品展示出高功率。此外,新的测试比现有方法更加强大,不要求通过数据分离选择内核带宽和其他参数的优化程序。数字研究显示,新的方法在合成数据和真实世界数据中都表现良好。