Learned Bloom Filters, i.e., models induced from data via machine learning techniques and solving the approximate set membership problem, have recently been introduced with the aim of enhancing the performance of standard Bloom Filters, with special focus on space occupancy. Unlike in the classical case, the "complexity" of the data used to build the filter might heavily impact on its performance. Therefore, here we propose the first in-depth analysis, to the best of our knowledge, for the performance assessment of a given Learned Bloom Filter, in conjunction with a given classifier, on a dataset of a given classification complexity. Indeed, we propose a novel methodology, supported by software, for designing, analyzing and implementing Learned Bloom Filters in function of specific constraints on their multi-criteria nature (that is, constraints involving space efficiency, false positive rate, and reject time). Our experiments show that the proposed methodology and the supporting software are valid and useful: we find out that only two classifiers have desirable properties in relation to problems with different data complexity, and, interestingly, none of them has been considered so far in the literature. We also experimentally show that the Sandwiched variant of Learned Bloom filters is the most robust to data complexity and classifier performance variability, as well as those usually having smaller reject times. The software can be readily used to test new Learned Bloom Filter proposals, which can be compared with the best ones identified here.
翻译:通过机器学习技术和解决近似成员构成问题的数据引出的模型,即从滚动过滤器中产生的模型,最近被引入,目的是提高标准Bloom过滤器的性能,特别侧重于空间占用。与古典案例不同,用于建立过滤器的数据的“复杂性”可能对其性能产生重大影响。因此,我们在这里建议,根据我们的知识,根据一个特定分类器,结合一个特定分类器,对特定数据复杂程度的数据集,对某个Pleom过滤器进行业绩评估,进行首次深入分析。事实上,我们提出了一种新型方法,在软件的支持下,设计、分析和实施Pledge Bloom过滤器,以发挥多标准性质(即涉及空间效率的限制、虚假积极率和拒绝时间)的具体限制功能。我们的实验表明,拟议的方法和辅助软件是有效和有用的:我们发现,只有两个分类器与数据复杂程度不同的问题有可取的特性,而且有趣的是,文献中也没有考虑过这么远。我们还试验性地展示了设计、精细的精细度变异性,通常与Bloom Bloom过滤器相比,这些更精确的变异性与Bloom Floom 的变异性相比,是用来的更精确性。