Conservative Count-Min, an improved version of Count-Min sketch [Cormode, Muthukrishnan 2005], is an online-maintained hashing-based data structure summarizing element frequency information without storing elements themselves. Although several works attempted to analyze the error that can be made by Count-Min, the behavior of this data structure remains poorly understood. In [Fusy, Kucherov 2022], we demonstrated that under the uniform distribution of input elements, the error of conservative Count-Min follows two distinct regimes depending on its load factor. In this work, we provide a series of experimental results providing new insights into the behavior of conservative Count-Min. Our contributions can be seen as twofold. On one hand, we provide a detailed experimental analysis of the behavior of Count-Min sketch in different regimes and under several representative probability distributions of input elements. On the other hand, we demonstrate improvements that can be made by assigning a variable number of hash functions to different elements. This includes, in particular, reduced space of the data structure while still supporting a small error.
翻译:保守的 Countical Control-Min 改进版的 Count-Min 草图[Cormode, Muthukrishnan 2005] 是一个在线维护的散列数据结构,在不存储元素本身的情况下总结元素频率信息。虽然有好几项工作试图分析伯爵-Min 能够作出的错误,但这一数据结构的行为仍然不甚为人理解。在[Fusy, Kucherov 2022] 中,我们证明,根据输入元素的统一分配,保守的 Count-Min 的错误遵循两个不同的制度。在这项工作中,我们提供了一系列实验结果,对保守的 Count-Min 的行为提供了新的洞察力。我们的贡献可以看作是双重的。一方面,我们提供了对 Count-Min 草图在不同制度下的行为的详细实验性分析,在几种有代表性的输入元素的概率分布下,我们展示了可以通过在不同元素中指定一个可变数的集质功能来改进。这特别包括减少数据结构的空间,同时仍然支持一个小错误。