Estimating the distribution and quantiles of data is a foundational task in data mining and data science. We study algorithms which provide accurate results for extreme quantile queries using a small amount of space, thus helping to understand the tails of the input distribution. Namely, we focus on two recent state-of-the-art solutions: $t$-digest and ReqSketch. While $t$-digest is a popular compact summary which works well in a variety of settings, ReqSketch comes with formal accuracy guarantees at the cost of its size growing as new observations are inserted. In this work, we provide insight into which conditions make one preferable to the other. Namely, we show how to construct inputs for $t$-digest that induce an almost arbitrarily large error and demonstrate that it fails to provide accurate results even on i.i.d. samples from a highly non-uniform distribution. We propose practical improvements to ReqSketch, making it faster than $t$-digest, while its error stays bounded on any instance. Still, our results confirm that $t$-digest remains more accurate on the ``non-adversarial'' data encountered in practice.
翻译:估算数据的分布和量化是数据挖掘和数据科学的一项基本任务。 我们研究利用少量空间为极端量化查询提供准确结果的算法, 从而帮助理解输入分布的尾部。 也就是说, 我们侧重于两个最新最先进的解决方案: $t$- digest 和 ReqSketch 。 虽然美元- digest 是一个在各种环境下效果很好的流行缩略语, ReqSketch 却以其大小不断增大的成本获得正式的准确性保证。 在这项工作中, 我们提供了对哪些条件使一个条件更适合另一个条件的洞察力。 也就是说, 我们展示了如何为最小的 $- digest 建构出一个几乎是任意大错误的投入, 并表明它甚至无法提供高度非统一分布的样本的准确结果。 我们建议对 Reqketch 进行实际的改进, 使其速度超过$t- dgetest, 但它的误差将一直局限在任何实例上。 然而, 我们的结果证实, $- dest- digest practality cload on the distialtial