Kernel two-sample testing provides a powerful framework for distinguishing any pair of distributions based on $n$ sample points. However, existing kernel tests either run in $n^2$ time or sacrifice undue power to improve runtime. To address these shortcomings, we introduce Compress Then Test (CTT), a new framework for high-powered kernel testing based on sample compression. CTT cheaply approximates an expensive test by compressing each $n$ point sample into a small but provably high-fidelity coreset. For standard kernels and subexponential distributions, CTT inherits the statistical behavior of a quadratic-time test -- recovering the same optimal detection boundary -- while running in near-linear time. We couple these advances with cheaper permutation testing, justified by new power analyses; improved time-vs.-quality guarantees for low-rank approximation; and a fast aggregation procedure for identifying especially discriminating kernels. In our experiments with real and simulated data, CTT and its extensions provide 20--200x speed-ups over state-of-the-art approximate MMD tests with no loss of power.
翻译:内核两样测试为区分基于美元抽样点的任何一对分布提供了强有力的框架。 但是,现有的内核测试要么用2美元的时间运行,要么牺牲不当的力量来改善运行时间。 为了解决这些缺陷,我们引入了压缩后测试(CTT)这个基于抽样压缩的高功率内核测试的新框架。 中央技术小组通过将每个点点样本压缩成一个小的但可察觉的高纤维核心体来廉价地估计昂贵的测试。 对于标准的内核和次爆炸性分布,中央技术小组继承了在接近线性时间运行的同时进行二次时间测试 -- -- 恢复同一最佳探测边界 -- -- 的统计行为。我们把这些进展与更廉价的变换测试结合起来,新的权力分析证明这是合理的; 改进了低级近距离的时-五质量保障; 以及一个用于确定特别有区别的内核的快速集成程序。 在我们用真实和模拟数据进行的实验中,中央技术小组及其扩展提供了20-200x速度超过州近似MMD试验。</s>