Learned indexes, which use machine learning models to replace traditional index structures, have shown promising results in recent studies. Existing learned indexes use heuristic rules to construct index structures, which are often suboptimal and sensitive to data distribution. In this paper, we argue that upper-level RMI nodes should focus on data partitioning instead of model fitting, and show that it leads to much better results in real-world datasets. We introduce entropy as a metric to quantify and characterize the models in learned indexes, which provides a new theoretical basis for subsequent works. Moreover, we propose a new memory layout design with a fixed node size throughout the tree structure, which allows the type of each node to be flexibly chosen at runtime. We propose CARMI, a new efficient and updatable cache-aware RMI framework. To reduce reliance on the expertise of database administrators, CARMI uses a hybrid construction algorithm to automatically construct the index structures under various datasets and workloads without any manual tuning. Our experimental study shows that CARMI performs better and is more robust compared to baselines, achieving an average of 2.37x/1.98x speedup compared to B+ Tree/ALEX, while using only about 0.70x memory space of B+ Tree. In the SOSD platform, CARMI outperforms all the baselines over the real-world datasets, with an average speedup of 1.21x over the nearest competitor. We believe that our theoretical analysis and proposed framework can help learned indexes to get closer to practical deployment.
翻译:使用机器学习模型取代传统指数结构的学进指数,在最近的研究中显示了令人乐观的结果。现有的学进指数使用超常规则构建指数结构,而指数结构往往不最优化,对数据分布敏感。在本文中,我们认为,高层RMI节点应侧重于数据分割,而不是模型安装,并表明它会在现实世界数据集中产生更好的结果。我们引入了英特基,作为衡量标准,量化和描述在所学指数中的模型,为随后的工作提供了新的理论基础。此外,我们建议了一个新的记忆布局设计,在整个树结构中有一个固定节点大小,使得每个节点的类型在预运行时能够灵活选择。我们建议CARMI,一个新的高效和可升级的缓存RMI框架。为了减少对数据库管理员的专门知识的依赖,CARMI使用混合的构建算法,在各种数据集和工作量下自动构建索引结构,不作任何手工调整。我们实验研究表明,CARMI比基线要更好和更稳健地运行,并且比基线更稳健,只有平均2.37x/98x的理论速度,比B+CREAL/CARAL平台要更接近一个更接近B10。