Reproducing Kernel Hilbert Space (RKHS) embedding of probability distributions has proved to be an effective approach, via MMD (maximum mean discrepancy), for nonparametric hypothesis testing problems involving distributions defined over general (non-Euclidean) domains. While a substantial amount of work has been done on this topic, only recently have minimax optimal two-sample tests been constructed that incorporate, unlike MMD, both the mean element and a regularized version of the covariance operator. However, as with most kernel algorithms, the optimal test scales cubically in the sample size, limiting its applicability. In this paper, we propose a spectral-regularized two-sample test based on random Fourier feature (RFF) approximation and investigate the trade-offs between statistical optimality and computational efficiency. We show the proposed test to be minimax optimal if the approximation order of RFF (which depends on the smoothness of the likelihood ratio and the decay rate of the eigenvalues of the integral operator) is sufficiently large. We develop a practically implementable permutation-based version of the proposed test with a data-adaptive strategy for selecting the regularization parameter. Finally, through numerical experiments on simulated and benchmark datasets, we demonstrate that the proposed RFF-based test is computationally efficient and performs almost similarly (with a small drop in power) to the exact test.
翻译:再生核希尔伯特空间(RKHS)通过最大均值差异(MMD)对概率分布进行嵌入,已被证明是处理定义在一般(非欧几里得)域上的分布所涉及的非参数假设检验问题的有效方法。尽管已有大量相关工作,但直到最近才构建出最小化最大最优的双样本检验,与MMD不同,该检验同时结合了均值元和协方差算子的正则化版本。然而,与大多数核算法一样,最优检验的计算复杂度随样本量呈立方级增长,限制了其适用性。本文提出一种基于随机傅里叶特征(RFF)近似的谱正则化双样本检验,并探讨统计最优性与计算效率之间的权衡。我们证明,若RFF的近似阶数(其取决于似然比的平滑度及积分算子特征值的衰减率)足够大,所提出的检验是最小化最大最优的。我们开发了一种基于置换的、可实际实施的检验版本,并采用数据自适应策略选择正则化参数。最后,通过在模拟和基准数据集上的数值实验,我们证明所提出的基于RFF的检验计算高效,且性能(在功效上有小幅下降)与精确检验几乎相当。