We present a study of kernel based two-sample test statistic, which is related to the Maximum Mean Discrepancy (MMD), in the manifold data setting, assuming that high-dimensional observations are close to a low-dimensional manifold. We characterize the test level and power in relation to the kernel bandwidth, the number of samples, and the intrinsic dimensionality of the manifold. Specifically, we show that when data densities are supported on a $d$-dimensional sub-manifold $\mathcal{M}$ embedded in an $m$-dimensional space, the kernel two-sample test for data sampled from a pair of distributions $(p, q)$ that are H\"older with order $\beta$ is consistent and powerful when the number of samples $n$ is greater than $\delta_2(p,q)^{-2-d/\beta}$ up to certain constant, where $\delta_2$ is the squared $\ell_2$-divergence between two distributions on manifold. Moreover, to achieve testing consistency under this scaling of $n$, our theory suggests that the kernel bandwidth $\gamma$ scales with $n^{-1/(d+2\beta)}$. These results indicate that the kernel two-sample test does not have a curse-of-dimensionality when the data lie on a low-dimensional manifold. We demonstrate the validity of our theory and the property of the kernel test for manifold data using several numerical experiments.
翻译:具体地说,我们提出一个基于内核的双模量测试统计研究,它与多元数据设置中的最大平均值差异值(MMD)有关,假设高维观测接近一个低维的元体。我们用内核带宽、样本数量和多元的内在维度来描述试验水平和能量。我们显示,当数据密度支持在以美元维基次维值为单位的1美元维基值下值$\mathcal{M}(MD)中嵌入的1美元维空间中时,假设高维度观测接近于一个低维度的多维值。当数据密度为美元大于$delta_p,q) ⁇ 2-d/d/beta}到一定的恒定值时,数据密度为$=2美元基值的平方元双基值测试。此外,如果使用这一数值测试的数值的数值测试,我们两个基值的数值的数值的数值值值值值值值值值值,则显示我们两个公式的基值的数值值值值值值 。