We introduce kernel thinning, a new procedure for compressing a distribution $\mathbb{P}$ more effectively than i.i.d. sampling or standard thinning. Given a suitable reproducing kernel $\mathbf{k}$ and $\mathcal{O}(n^2)$ time, kernel thinning compresses an $n$-point approximation to $\mathbb{P}$ into a $\sqrt{n}$-point approximation with comparable worst-case integration error in the associated reproducing kernel Hilbert space. With high probability, the maximum discrepancy in integration error is $\mathcal{O}_d(n^{-\frac{1}{2}}\sqrt{\log n})$ for compactly supported $\mathbb{P}$ and $\mathcal{O}_d(n^{-\frac{1}{2}} \sqrt{(\log n)^{d+1}\log\log n})$ for sub-exponential $\mathbb{P}$ on $\mathbb{R}^d$. In contrast, an equal-sized i.i.d. sample from $\mathbb{P}$ suffers $\Omega(n^{-\frac14})$ integration error. Our sub-exponential guarantees resemble the classical quasi-Monte Carlo error rates for uniform $\mathbb{P}$ on $[0,1]^d$ but apply to general distributions on $\mathbb{R}^d$ and a wide range of common kernels. We use our results to derive explicit non-asymptotic maximum mean discrepancy bounds for Gaussian, Mat\'ern, and B-spline kernels and present two vignettes illustrating the practical benefits of kernel thinning over i.i.d. sampling and standard Markov chain Monte Carlo thinning.
翻译:我们引入了内核稀释, 一种比i.d. 取样或标准稀释更有效地压缩发行量 ${mathb{P} 美元的新程序。 如果一个合适的复制内核$\ mathbf{k} 美元和$\ mathcal{O} (n%2) 时间, 内核稀释压缩到$\mathb{P} 美元和 $\\\\ mathcr{ m} 美元接近 $\ sqrt{n} 美元接近点, 在相关的再生产中出现类似最坏的内核整合错误 。 在极有可能的情况下, 整合的最大差值是 $\\\ mathb{k} 美元, B\\\\\\\\\\ frqrr\ 美元 美元 美元, 用于常规的内核内核内核内核内核内核 。