In data mining, estimating the number of distinct values (NDV) is a fundamental problem with various applications. Existing methods for estimating NDV can be broadly classified into two categories: i) scanning-based methods, which scan the entire data and maintain a sketch to approximate NDV; and ii) sampling-based methods, which estimate NDV using sampling data rather than accessing the entire data warehouse. Scanning-based methods achieve a lower approximation error at the cost of higher I/O and more time. Sampling-based estimation is preferable in applications with a large data volume and a permissible error restriction due to its higher scalability. However, while the sampling-based method is more effective on a single machine, it is less practical in a distributed environment with massive data volumes. For obtaining the final NDV estimators, the entire sample must be transferred throughout the distributed system, incurring a prohibitive communication cost when the sample rate is significant. This paper proposes a novel sketch-based distributed method that achieves sub-linear communication costs for distributed sampling-based NDV estimation under mild assumptions. Our method leverages a sketch-based algorithm to estimate the sample's {\em frequency of frequency} in the {\em distributed streaming model}, which is compatible with most classical sampling-based NDV estimators. Additionally, we provide theoretical evidence for our method's ability to minimize communication costs in the worst-case scenario. Extensive experiments show that our method saves orders of magnitude in communication costs compared to existing sampling- and sketch-based methods.
翻译:在数据开采中,估算不同值(NDV)是各种应用的根本问题。估算NDV的现有方法可以大致分为两类:一)基于扫描的方法,扫描整个数据并维持大致的NDV;二)基于取样的方法,利用抽样数据而不是访问整个数据仓库对NDV进行估计;以扫描为基础的方法,以更高的I/O和更多的时间为代价,得出较低的近似误差。在数据量大和可允许的误差限制的情况下,基于抽样的估计在应用中更为可取。不过,取样方法在单一机器中更为有效,但在分布的环境中,使用大量数据数量来扫描全部数据并保持草图;二)基于取样的方法,利用最终的NDV估计,整个样本必须在整个分布系统中转移,如果抽样率很高,通信费用就会令人望而望。本文建议一种基于草图的分布方法,在较轻的假设下,为分发基于抽样的NDVV估计数的次线通信成本达到最差的次线性估算。然而,虽然基于取样方法在基于草图的假设中利用一种基于逻辑的算算法,用以估算我们现有取样频率的样本分析方法的频率,显示我们现有样本的频率的样本的频率。