We consider the task of constructing a data structure for associating a static set of keys with values, while allowing arbitrary output values for queries involving keys outside the set. Compared to hash tables, these so-called static function data structures do not need to store the key set and thus use significantly less memory. Several techniques are known, with compressed static functions approaching the zero-order empirical entropy of the value sequence. In this paper, we introduce learned static functions, which use machine learning to capture correlations between keys and values. For each key, a model predicts a probability distribution over the values, from which we derive a key-specific prefix code to compactly encode the true value. The resulting codeword is stored in a classic static function data structure. This design allows learned static functions to break the zero-order entropy barrier while still supporting point queries. Our experiments show substantial space savings: up to one order of magnitude on real data, and up to three orders of magnitude on synthetic data.
翻译:本文研究为静态键集构建关联值的数据结构,同时允许对集合外键的查询返回任意输出值。与哈希表相比,这类所谓的静态函数数据结构无需存储键集,因而内存占用显著降低。现有多种技术已实现此目标,其中压缩静态函数可逼近值序列的零阶经验熵。本文提出学习型静态函数,利用机器学习捕捉键与值之间的相关性。针对每个键,模型预测值的概率分布,并据此推导出键特定的前缀编码以紧凑表示真实值。生成的码字存储于经典静态函数数据结构中。该设计使学习型静态函数能够突破零阶熵限制,同时支持点查询。实验结果表明:在真实数据上可实现高达一个数量级的空间节省,在合成数据上可达三个数量级。