An $\varepsilon$-approximate quantile sketch over a stream of $n$ inputs approximates the rank of any query point $q$ - that is, the number of input points less than $q$ - up to an additive error of $\varepsilon n$, generally with some probability of at least $1 - 1/\mathrm{poly}(n)$, while consuming $o(n)$ space. While the celebrated KLL sketch of Karnin, Lang, and Liberty achieves a provably optimal quantile approximation algorithm over worst-case streams, the approximations it achieves in practice are often far from optimal. Indeed, the most commonly used technique in practice is Dunning's t-digest, which often achieves much better approximations than KLL on real-world data but is known to have arbitrarily large errors in the worst case. We apply interpolation techniques to the streaming quantiles problem to attempt to achieve better approximations on real-world data sets than KLL while maintaining similar guarantees in the worst case.
翻译:一组包括n个输入的流数据中,一个$\varepsilon$-近似的分位数草图近似估计了任何查询点$q$的排名 - 也就是小于$q$的输入点数 -,误差为$\varepsilon n$,通常以至少$1-1/\mathrm{poly}(n)$的概率对最坏情况流采用占用$o(n)$空间的最优化分位数近似算法。虽然Karnin、Lang和Liberty的著名KLL草图在最坏情况流上实现了可证明的最优分位数近似算法,但在实践中所实现的近似估计通常远非最优。实际上,实践中最常用的技术是Dunning的t-digest,它在真实数据上往往比KLL实现了更好的近似估计,但其在最坏情况下已知存在无限大的误差。我们将插值技术应用于流分位数问题,试图在保持类似的最坏情况保证的同时,在真实数据集上实现比KLL更好的近似估计。