The amount of data coming from different sources such as IoT-sensors, social networks, cellular networks, has increased exponentially during the last few years. Probabilistic Data Structures (PDS) are efficient alternatives to deterministic data structures suitable for large data processing and streaming applications. They are mainly used for approximate membership queries, frequency count, cardinality estimation and similarity research. Finding the number of distinct elements in a large dataset or in streaming data is an active research area. In this work, we show that usual methods based on Bloom filters for this kind of cardinality estimation are relatively accurate on average but have a high variance. Therefore, reducing this variance is interesting to obtain accurate statistics. We propose a probabilistic approach to estimate more accurately the cardinality of a Bloom filter based on its parameters, i.e., number of hash functions $k$, size $m$, and a counter $s$ which is incremented whenever an element is not in the filter (i.e., when the result of the membership query for this element is negative). The value of the counter can never be larger than the exact cardinality due to the Bloom filter's nature, but hash collisions can cause it to underestimate it. This creates a counting error that we estimate accurately, in-stream, along with its standard deviation. We also discuss a way to optimize the parameters of a Bloom filter based on its counting error. We evaluate our approach with synthetic data created from an analysis of a real mobility dataset provided by a mobile network operator in the form of displacement matrices computed from mobile phone records. The approach proposed here performs at least as well on average and has a much lower variance (about 6 to 7 times less) than state of the art methods.
翻译:来自不同来源的数据量,如IoT传感器、社交网络、蜂窝网络等,在过去几年中急剧增加。概率数据结构(PDS)是适合于大型数据处理和流式应用程序的确定性数据结构的有效替代物。主要用于大致的会籍查询、频率计数、基底估计和相似性研究。在大型数据集或流数据中查找不同元素的数量是一个活跃的研究领域。在这项工作中,我们显示基于Bloom过滤器过滤器的通常方法,用于这种基本估计的通常方法平均比较准确,但差异很大。因此,降低这种移动性结构是获取准确统计数据的有趣替代物。我们建议采用一种概率方法,更准确地估计Bloom过滤器的基点,即,以其参数为基础,即,即,数以美元为单位,大小为单位,或以美元为单位,在某个要素不在7过滤器中,即,我们创建了这种基底基点的基点,因此,其价值从更小于准确的底值,而其底值则以精确的底值计算结果。