Weighted Bloom filters (Bruck, Gao and Jiang, ISIT 2006) are Bloom filters that adapt the number of hash functions according to the query element. That is, they use a sequence of hash functions $h_1, h_2, \dots$ and insert $x$ by setting the bits in $k_x$ positions $h_1(x), h_2(x), \dots, h_{k_x}(x)$ to 1, where the parameter $k_x$ depends on $x$. Similarly, a query for $x$ checks whether the bits at positions $h_1(x), h_2(x), \dots, h_{k_x}(x)$ contain a $0$ (in which case we know that $x$ was not inserted), or contains only $1$s (in which case $x$ may have been inserted, but it could also be a false positive). In this paper, we determine a near-optimal choice of the parameters $k_x$ in a model where $n$ elements are inserted independently from a probability distribution $\mathcal{P}$ and query elements are chosen from a probability distribution $\mathcal{Q}$, under a bound on the false positive probability $F$. In contrast, the parameter choice of Bruck et al., as well as follow-up work by Wang et al., does not guarantee a nontrivial bound on the false positive rate. We refer to our parameterization of the weighted Bloom filter as a $\textit{Daisy Bloom filter}$. For many distributions $\mathcal{P}$ and $\mathcal{Q}$, the Daisy Bloom filter space usage is significantly smaller than that of Standard Bloom filters. Our upper bound is complemented with an information-theoretical lower bound, showing that (with mild restrictions on the distributions $\mathcal{P}$ and $\mathcal{Q}$), the space usage of Daisy Bloom filters is the best possible up to a constant factor. Daisy Bloom filters can be seen as a fine-grained variant of a recent data structure of Vaidya, Knorr, Mitzenmacher and Kraska. Like their work, we are motivated by settings in which we have prior knowledge of the workload of the filter, possibly in the form of advice from a machine learning algorithm.
翻译:(bruck, Gao和Jiang, ISIT, 2006) 是Bloom 过滤器, 可以根据查询元素调整散列函数的数量。 也就是说, 它们使用h_ 1, h_ 2,\dots 美元, 插入美元x美元, 将位数设置在 $_x, h_x, h_x, \dots 美元到 1 美元, 其中参数为 美元xx 。 同样, 查询 $x 美元是否根据查询值调整散列函数数量。 也就是说, 它们使用h_ 1, h_ 2, 美元, 美元, 并插入 美元, 插入美元xx, 插入美元xxx, 插入美元xxxx 美元, 插入美元xxx 美元, 以1美元为美元, 以 美元, 以更低的空域值为准。 在本文中, 我们确定在模型中选择 美元xal_xal, 美元, 以美元为正值, 以正值 美元 美元 。