A Monotone Minimal Perfect Hash Function (MMPHF) constructed on a set S of keys is a function that maps each key in S to its rank. On keys not in S, the function returns an arbitrary value. Applications range from databases, search engines, data encryption, to pattern-matching algorithms. In this paper, we describe LeMonHash, a new technique for constructing MMPHFs for integers. The core idea of LeMonHash is surprisingly simple and effective: we learn a monotone mapping from keys to their rank via an error-bounded piecewise linear model (the PGM-index), and then we solve the collisions that might arise among keys mapping to the same rank estimate by associating small integers with them in a retrieval data structure (BuRR). On synthetic random datasets, LeMonHash needs 35% less space than the next best competitor, while achieving about 16 times faster queries. On real-world datasets, the space usage is very close to or much better than the best competitors, while achieving up to 19 times faster queries than the next larger competitor. As far as the construction of LeMonHash is concerned, we get an improvement by a factor of up to 2, compared to the competitor with the next best space usage. We also investigate the case of keys being variable-length strings, introducing the so-called LeMonHash-VL: it needs space within 10% of the best competitors while achieving up to 3 times faster queries.
翻译:学习的单调最小完美哈希
一个在键值集合S上构建的单调最小完美哈希函数(MMPHF)是一个将每个键映射到其排名的函数。对于S中不存在的键,函数返回任意值。应用范围包括数据库、搜索引擎、数据加密和模式匹配算法。在本文中,我们介绍了一种新技术LeMonHash,用于构建整数的MMPH。LeMonHash的核心思想非常简单而有效:我们通过一个误差有界的分段线性模型(PGM索引)学习了一个从键到它们的排名的单调映射,然后我们在检索数据结构(BuRR)中将映射到同一排名的键赋予小整数,以解决可能产生的冲突。在合成随机数据集上,LeMonHash所需的空间比下一个最佳竞争者少35%,同时查询速度快约16倍。在真实数据集上,空间使用与最佳竞争者非常接近或更好,同时查询速度比下一个更大的竞争者快达19倍。就LeMonHash的构建而言,在空间使用方面,我们获得了比下一个最佳空间使用竞争者高达2倍的改进。我们还研究了键值为可变长度字符串的情况,引入了所谓的LeMonHash-VL:它所需空间不到最佳竞争者的10%,同时查询速度快达3倍。