K- mers 的最小完美散列 (Locality-Preserving Minimal Perfect Hashing of k-mers)

Minimal perfect hashing is the problem of mapping a static set of $n$ distinct keys into the address space $\{1,\ldots,n\}$ bijectively. It is well-known that $n\log_2 e$ bits are necessary to specify a minimal perfect hash function $f$, when no additional knowledge of the input keys is to be used. However, it is often the case in practice that the input keys have intrinsic relationships that we can exploit to lower the bit complexity of $f$. For example, consider a string and the set of all its distinct sub-strings of length $k$ - the so-called $k$-mers of the string. Two consecutive $k$-mers in the string have a strong intrinsic relationship in that they share an overlap of $k-1$ symbols. Hence, it seems intuitively possible to beat the classic $\log_2 e$ bits/key barrier in this case. Moreover, we would like $f$ to map consecutive $k$-mers to consecutive addresses, as to preserve as much as possible the relationships between the keys also in the co-domain $\{1,\ldots,n\}$. This is a useful feature in practice as it guarantees a certain degree of locality of reference for $f$, resulting in a better evaluation time when querying consecutive $k$-mers from a string. Motivated by these premises, we initiate the study of a new type of locality-preserving minimal perfect hash functions designed for $k$-mers extracted consecutively from a string (or collections of strings). We show a theoretic lower bound on the bit complexity of any $(1-\varepsilon)$-locality-preserving MPHF, for a parameter $0 < \varepsilon < 1$. The complexity is lower than $n\log_2 e$ bits for sufficiently small $\varepsilon$. We propose a construction that approaches the theoretic minimum space for growing $k$ and present a practical implementation of the method.

翻译：最小完美 hash 是一个在地址空间中绘制一组固定的美元特殊密钥的问题 $1,\\ ldots,n\ 美元双振。众所周知, $n\ log_ 2 e bits是需要指定一个最小的完美 hash 函数 $f$, 当不使用对输入密钥的额外知识时, 需要使用最小的 $f$ 。然而, 在实践中, 输入密钥有内在关系, 我们可以利用它来降低 $f美元的位复杂度。例如, 考虑一个字符串和它所有不同的小字串, 美元, 也就是字符串中所谓的 $xk 美元。连续两个连续的 $k ebit 元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元元