Indexing large-scale databases in main memory is still challenging today. Learned index structures -- in which the core components of classical indexes are replaced with machine learning models -- have recently been suggested to significantly improve performance for read-only range queries. However, a recent benchmark study shows that learned indexes only achieve limited performance improvements for real-world data on modern hardware. More specifically, a learned model cannot learn the micro-level details and fluctuations of data distributions thus resulting in poor accuracy; or it can fit to the data distribution at the cost of training a big model whose parameters cannot fit into cache. As a consequence, querying a learned index on real-world data takes a substantial number of memory lookups, thereby degrading performance. In this paper, we adopt a different approach for modeling a data distribution that complements the model fitting approach of learned indexes. We propose Shift-Table, an algorithmic layer that captures the micro-level data distribution and resolves the local biases of a learned model at the cost of at most one memory lookup. Our suggested model combines the low latency of lookup tables with learned indexes and enables low-latency processing of range queries. Using Shift-Table, we achieve a speedup of 1.5X to 2X on real-world datasets compared to trained and tuned learned indexes.
翻译:在主记忆中,大规模数据库的指数化今天仍具有挑战性。最近有人建议,将古典指数的核心组成部分替换为机械学习模型的指数结构,以显著提高只读范围查询的性能。然而,最近的一项基准研究显示,所学指数只能对现代硬件实际世界数据实现有限的性能改进。更具体地说,所学的模型无法了解微观一级的细节和数据分布的波动,从而导致数据分布的准确性差;或者它能够与数据分布相适应,而培训一个其参数无法适应缓存的大模型。因此,查询关于真实世界数据的知识性指数需要大量内存调查,从而降低性能。在本文件中,我们采用不同的方法建模数据分配,以补充所学指数的模型。我们提议了S Shift-表,即一个算法层,它能捕捉微观一级数据分布,并解决所学模型在当地的偏差,其成本在多数一次记忆调查中都是如此。我们建议的模型将低的外观表与所学指数相结合,并使得低延处理范围查询。我们采用不同的模型,用经训练过的S Shift-X表,我们用经训练的Sqlft-Xsldable toto dest to dest to dismax。