Recent work proposed learned index structures, which learn the distribution of the underlying dataset to improve performance. The initial work on learned indexes has shown that by learning the cumulative distribution function of the data, index structures such as the B-Tree can improve their performance by one order of magnitude while having a smaller memory footprint. In this paper, we present COAX, a learned index for multidimensional data that, instead of learning the distribution of keys, learns the correlations between attributes of the dataset. Our approach is driven by the observation that in many datasets, values of two (or multiple) attributes are correlated. COAX exploits these correlations to reduce the dimensionality of the datasets. More precisely, we learn how to infer one (or multiple) attribute $C_d$ from the remaining attributes and hence no longer need to index attribute $C_d$. This reduces the dimensionality and hence makes the index smaller and more efficient. We theoretically investigate the effectiveness of the proposed technique based on the predictability of the FD attributes. We further show experimentally that by predicting correlated attributes in the data, we can improve the query execution time and reduce the memory overhead of the index. In our experiments, we reduce the execution time by 25% while reducing the memory footprint of the index by four orders of magnitude.
翻译:最近的工作建议了学习的指数结构,它学习了基础数据集的分布来改进性能。关于学习过的指数的初步工作表明,通过学习数据累积分布功能,B-Tree等指数结构可以提高一个数量级的性能,同时减少记忆足迹。在本文中,我们提出一个多维数据学的指数COAX,它不是学习键的分布,而是学习数据集属性之间的关联。我们的方法受到以下观察的驱动:在许多数据集中,两个(或多个)属性的值是相互关联的。COAX利用这些关联来减少数据集的维度。更准确地说,我们学会如何从剩余属性中推算一个(或多个)美元,将其属性归为$C_d$,从而不再需要指数归为$C_d$。这降低了维度,因此使得指数更小、更有效率。我们理论上根据FD属性的可预测性来调查拟议技术的有效性。我们进一步实验通过预测数据中的关联性属性来减少数据集的维度。我们学会如何从其余属性中推算出一个(或多个)将一个(或多个)指数归为$$_ddd),我们可以通过实验来减少执行时间的缩缩缩缩缩缩缩。