We discuss the problem of counting distinct elements in a stream. A stream is usually considered as a sequence of elements that come one at a time. An exact solution to the problem requires memory space of the size of the stream. For many applications this solution is infeasible due to very large streams. The solution in that case, is to use a probabilistic data structure (also called sketch), from which we can estimate with high accuracy the cardinality of the stream. We present a new algorithm, ExtendedHyperLogLog (EHLL), which is based on the state-of-the-art algorithm, HyperLogLog (HLL). In order to achieve the same accuracy as HLL, EHLL uses 16% less memory. In recent years, a martingale approach has bean developed. In the martingale setting we receive better accuracy at the price of not being able to merge sketches. EHLL also works in the martingale setting. Martingale EHLL achieves the same accuracy as Martingale HLL using 12% less memory.
翻译:我们讨论在流中计算不同元素的问题。 流通常被视为一个元素序列, 一次产生一个元素。 问题的精确解决方案需要流体大小的记忆空间。 对于许多应用来说, 这个解决方案是无法做到的, 因为流体非常大。 在这种情况下, 解决办法是使用概率性数据结构( 也称为草图), 我们可以从中非常精确地估计流体的基点。 我们提出了一个新的算法, 扩展 HyperLogLog (ELL), 其基础是最新算法, 超LogLog (HLL) 。 为了实现与 HLL 相同的精确度, ELL 使用比 16 % 的记忆。 最近几年, martingale 方法已经形成。 在 martingale 设置中, 我们以无法合并草图的价格得到更好的精度。 ELL 还在 martingale 设置中工作 。 Martingale EHLLL 实现与 Martingale 低12% 的记忆一样的精度 。