UDDSKETCH is a recent algorithm for accurate tracking of quantiles in data streams, derived from the DDSKETCH algorithm. UDDSKETCH provides accuracy guarantees covering the full range of quantiles independently of the input distribution and greatly improves the accuracy with regard to DDSKETCH. In this paper we show how to compress and fuse data streams (or datasets) by using UDDSKETCH data summaries that are fused into a new summary related to the union of the streams (or datasets) processed by the input summaries whilst preserving both the error and size guarantees provided by UDDSKETCH. This property of sketches, known as mergeability, enables parallel and distributed processing. We prove that UDDSKETCH is fully mergeable and introduce a parallel version of UDDSKETCH suitable for message-passing based architectures. We formally prove its correctness and compare it to a parallel version of DDSKETCH, showing through extensive experimental results that our parallel algorithm almost always outperforms the parallel DDSKETCH algorithm with regard to the overall accuracy in determining the quantiles.
翻译:UDDSKETCH 是一种根据 DDSKETCH 算法对数据流中的孔径进行准确跟踪的最新算法。 UDSKETCH 提供准确性保障,覆盖所有孔径,独立于输入分布,并大大提高DDSKETCH的准确性。在本文中,我们展示了如何通过使用 UDDSKETCH 数据摘要压缩和整合数据流(或数据集)的方法,这些数据摘要被纳入输入摘要所处理的流流(或数据集)组合的新摘要中,同时保留UDSKETCH 所提供的错误和大小保证。这种素描的属性被称为合并性,能够实现平行和分布处理。我们证明UDDSKETCH 完全可以合并,并引入了适合以信息接收为基础的结构的平行版本UDDDSKETCH。我们正式证明其正确性,并将它与DSKETCH的平行版本进行比较,通过广泛的实验结果显示,我们的平行算法几乎都比平行的DSKETCH 算法在确定质量方面的总体精确性。