近线时间分布压缩 (Distribution Compression in Near-linear Time)

from arxiv, Accepted to ICLR 2022; An outdated proof of Theorem 2 was previously included in the appendix; this oversight is corrected in this version

In distribution compression, one aims to accurately summarize a probability distribution $\mathbb{P}$ using a small number of representative points. Near-optimal thinning procedures achieve this goal by sampling $n$ points from a Markov chain and identifying $\sqrt{n}$ points with $\widetilde{\mathcal{O}}(1/\sqrt{n})$ discrepancy to $\mathbb{P}$. Unfortunately, these algorithms suffer from quadratic or super-quadratic runtime in the sample size $n$. To address this deficiency, we introduce Compress++, a simple meta-procedure for speeding up any thinning algorithm while suffering at most a factor of $4$ in error. When combined with the quadratic-time kernel halving and kernel thinning algorithms of Dwivedi and Mackey (2021), Compress++ delivers $\sqrt{n}$ points with $\mathcal{O}(\sqrt{\log n/n})$ integration error and better-than-Monte-Carlo maximum mean discrepancy in $\mathcal{O}(n \log^3 n)$ time and $\mathcal{O}( \sqrt{n} \log^2 n )$ space. Moreover, Compress++ enjoys the same near-linear runtime given any quadratic-time input and reduces the runtime of super-quadratic algorithms by a square-root factor. In our benchmarks with high-dimensional Monte Carlo samples and Markov chains targeting challenging differential equation posteriors, Compress++ matches or nearly matches the accuracy of its input algorithm in orders of magnitude less time.

翻译：在发行压缩中, 一个用少量代表点来准确总结概率分布值 $\ mathbb{P} 美元。近乎最佳的稀薄程序通过从 Markov 链条中取样 $n 美元, 并用$\ lobilde_ mathcal {O} (1/\\\ sqrt{n} 美元与 $\ mathbb{P} 美元来准确总结概率分布值 $\ mathbb{ p} 。不幸的是, 这些算法在样本大小中存在四倍或超二次运行运行运行时间值。为了解决这个问题, 我们引入了 Compress+, 一种简单的元化程序, 加速任何减瘦动算算, 最多造成四美元误差。当与 Dwivedi 和 Mackey (2021年) 的二次运行时空内递增量和美元最高值值值值( 美元 ), 运行时间- 内降量- 内降量- 内降量- 内压- 内程- 内程- 内调值- 内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内内