高多样性数据一致和灵活的选择性估计 (Consistent and Flexible Selectivity Estimation for High-Dimensional Data)

Selectivity estimation aims at estimating the number of database objects that satisfy a selection criterion. Answering this problem accurately and efficiently is essential to many applications, such as density estimation, outlier detection, query optimization, and data integration. The estimation problem is especially challenging for large-scale high-dimensional data due to the curse of dimensionality, the large variance of selectivity across different queries, and the need to make the estimator consistent (i.e., the selectivity is non-decreasing in the threshold). We propose a new deep learning-based model that learns a query-dependent piecewise linear function as selectivity estimator, which is flexible to fit the selectivity curve of any distance function and query object, while guaranteeing that the output is non-decreasing in the threshold. To improve the accuracy for large datasets, we propose to partition the dataset into multiple disjoint subsets and build a local model on each of them. We perform experiments on real datasets and show that the proposed model consistently outperforms state-of-the-art models in accuracy in an efficient way and is useful for real applications.

翻译：选择性估计旨在估计符合选择标准的数据库对象的数量。准确和高效地回答这个问题对于许多应用, 如密度估计、异常检测、查询优化和数据集成等, 至关重要。估计问题对于大型高维数据特别具有挑战性, 原因是维度的诅咒、不同查询的选择性差异很大, 以及需要使估计值保持一致( 即, 选择性不是临界值的下降 ) 。我们提出了一个新的深层次的基于学习的模型, 该模型以选择性测算器的形式学习依赖查询的笔直线函数, 即选择性测算器, 它灵活地适应任何远程函数和查询对象的选择性曲线, 同时保证输出不会在临界值中下降。为了提高大数据集的准确性, 我们提议将数据集分成成多个互不相连的子集, 并在其中每个子集上建立本地模型。我们在真实数据集上进行实验, 并显示, 拟议的模型在有效的方式上始终超越了状态, 并且对实际应用有用。