We design the first learned index that solves the dictionary problem with time and space complexity provably better than classic data structures for hierarchical memories, such as B-trees, and modern learned indexes. We call our solution the Piecewise Geometric Model index (PGM-index) because it turns the indexing of a sequence of keys into the coverage of a sequence of 2D-points via linear models (i.e. segments) suitably learned to trade query time vs space efficiency. This idea comes from some known heuristic results which we strengthen by showing that the minimal number of such segments can be computed via known and optimal streaming algorithms. Our index is then obtained by recursively applying this geometric idea that guarantees a smoothed adaptation to the "geometric complexity" of the input data. Finally, we propose a variant of the index that adapts not only to the distribution of the dictionary keys but also to their access frequencies, thus obtaining the first distribution-aware learned index. The second main contribution of this paper is the proposal and study of the concept of Multicriteria Data Structure, namely one that asks a data structure to adapt in an automatic way to the constraints imposed by the application of use. We show that our index is a multicriteria data structure because its significant flexibility in storage and query time can be exploited by a properly designed optimisation algorithm that efficiently finds its best design setting in order to match the input constraints. A thorough experimental analysis shows that our index and its multicriteria variant improve uniformly, over both time and space, classic and learned indexes up to several orders of magnitude.
翻译:我们设计了第一个以时间和空间复杂度解决字典问题、时间和空间复杂度比传统数据结构(如B-Trees)和现代学习指数更好的方法解决字典问题。我们把第一个学得指数称为“Peafwith 几何模型指数(PGM-index) ”, 因为它通过线性模型(即区段)将一个键序列的索引转换成一个2D点序列的覆盖范围,通过适当学习来交换查询时间与空间效率。这个想法来自一些已知的超常排序,我们通过显示通过已知和最佳流法算法来计算这类部分的最小数量。然后,我们通过反复应用这个几何几何几何测法概念来获得我们的索引,保证对输入数据的“地理复杂性”进行平稳的调整。最后,我们提出了一个指数的变式,它不仅适应于字典键的分布,而且还适应它们的访问频率,从而获得第一个经流学得的指数。 本文的第二个主要贡献是提出和研究多标准数据结构的概念,即要求数据结构中的最佳数据结构的精确时间结构,通过反复应用这个数据结构来调整数据结构的精确地调整,因为我们的数据结构的精确度是用来调整,我们的数据结构,我们所设计的弹性,我们用一个实验性标准,我们用的方法来显示它是如何调整了它所设计的结构。 我们的弹性,我们用到一个实验性标准化的弹性, 。