First introduced in 1954, linear probing is one of the oldest data structures in computer science, and due to its unrivaled data locality, it continues to be one of the fastest hash tables in practice. It is widely believed and taught, however, that linear probing should never be used at high load factors; this is because primary-clustering effects cause insertions at load factor $1 - 1 /x$ to take expected time $\Theta(x^2)$ (rather than the ideal $\Theta(x)$). The dangers of primary clustering, first discovered by Knuth in 1963, have been taught to generations of computer scientists, and have influenced the design of some of many widely used hash tables. We show that primary clustering is not a foregone conclusion. We demonstrate that small design decisions in how deletions are implemented have dramatic effects on the asymptotic performance of insertions, so that, even if a hash table operates continuously at a load factor $1 - \Theta(1/x)$, the expected amortized cost per operation is $\tilde{O}(x)$. This is because tombstones created by deletions actually cause an anti-clustering effect that combats primary clustering. We also present a new variant of linear probing (which we call graveyard hashing) that completely eliminates primary clustering on \emph{any} sequence of operations: if, when an operation is performed, the current load factor is $1 - 1/x$ for some $x$, then the expected cost of the operation is $O(x)$. One corollary is that, in the external-memory model with a data blocks of size $B$, graveyard hashing offers the following remarkable guarantee: at any load factor $1 - 1/x$ satisfying $x = o(B)$, graveyard hashing achieves $1 + o(1)$ expected block transfers per operation. Past external-memory hash tables have only been able to offer a $1 + o(1)$ guarantee when the block size $B$ is at least $\Omega(x^2)$.
翻译:1954年首次推出, 线性检测是计算机科学中最古老的数据结构之一, 并且由于它没有价值的数据所在地点, 它仍然是实际中最快的散列表格之一。 但是, 它被广泛相信和教导, 线性检测绝不应用于高负荷系数; 这是因为初级集束效应导致在负载系数1 - 1 /x美元上插入负载系数1 - 1x2美元( 而不是理想的 $Theta( x)美元 美元 ) 。 由于 Knuth 1963 首次发现的基本集成的危险, 已经教给计算机科学家的代代名 $ 。 并且已经影响了许多广泛使用的散列表的操作设计 。 我们显示, 原始集不应在高负荷系数1 - 1 / 1 / x /x 的负载效果上插入, 因此即使一个有色的表在负载系数1 - Theta (1/x) 美元 美元 美元 。 当时, 预言的每次集成成本成本是 $ = 美元 = 美元 正在运行中, 我们的直径 直径 的磁质数据 。