Sample- and computationally-efficient distribution estimation is a fundamental tenet in statistics and machine learning. We present SURF, an algorithm for approximating distributions by piecewise polynomials. SURF is: simple, replacing prior complex optimization techniques by straight-forward {empirical probability} approximation of each potential polynomial piece {through simple empirical-probability interpolation}, and using plain divide-and-conquer to merge the pieces; universal, as well-known polynomial-approximation results imply that it accurately approximates a large class of common distributions; robust to distribution mis-specification as for any degree $d \le 8$, it estimates any distribution to an $\ell_1$ distance $< 3$ times that of the nearest degree-$d$ piecewise polynomial, improving known factor upper bounds of 3 for single polynomials and 15 for polynomials with arbitrarily many pieces; fast, using optimal sample complexity, running in near sample-linear time, and if given sorted samples it may be parallelized to run in sub-linear time. In experiments, SURF outperforms state-of-the art algorithms.
翻译:样本和计算高效的分布估计是统计和机器学习中的一个基本原则。 我们展示了SURRF, 一种接近以片态多元分布分布的算法。 SURF是: 简单, 以直向[ 模拟概率] 取代先前的复杂优化技术。 SURF是: 简单, 以直向( 直向) {通过简单经验- 概率内插} 近似每种潜在多元片的近似值取代先前的复杂优化技术。 使用简单分化和偏差来合并这些片段; 通用的、 众所周知的多元- 相配比结果, 意味着它准确接近于大类通用分布; 强于任何程度( $d\le 8 $) 的分布错误区分, 它估计任何分布到 $\ ell_ 1 距离 < 3倍于最接近度( 度- d 美元) 的片态多元度的近位数, 改进已知因素的上限为 3, 与任意性多片状的多片段的 15 ; 快速, 意味着它使用最佳的样本复杂性, 快速, 运行在接近样本- 直径近线时间里, 如果给排序样本样本, 如果给标定的样本,, 它在子的时序中, 它可能平行的SURF 实验。