Due to the outstanding capability of capturing underlying data distributions, deep learning techniques have been recently utilized for a series of traditional database problems. In this paper, we investigate the possibilities of utilizing deep learning for cardinality estimation of similarity selection. Answering this problem accurately and efficiently is essential to many data management applications, especially for query optimization. Moreover, in some applications the estimated cardinality is supposed to be consistent and interpretable. Hence a monotonic estimation w.r.t. the query threshold is preferred. We propose a novel and generic method that can be applied to any data type and distance function. Our method consists of a feature extraction model and a regression model. The feature extraction model transforms original data and threshold to a Hamming space, in which a deep learning-based regression model is utilized to exploit the incremental property of cardinality w.r.t. the threshold for both accuracy and monotonicity. We develop a training strategy tailored to our model as well as techniques for fast estimation. We also discuss how to handle updates. We demonstrate the accuracy and the efficiency of our method through experiments, and show how it improves the performance of a query optimizer.
翻译:由于收集基本数据分布的出色能力,最近对一系列传统数据库问题使用了深层次的学习技术。在本文中,我们研究了利用深层学习对相似选择进行最基本估计的可能性。准确和高效地解决这个问题对于许多数据管理应用程序,特别是查询优化至关重要。此外,在一些应用中,估计的基点应该一致和可解释。因此,查询阈值比较可取。我们提出了一个可用于任何数据类型和距离功能的新颖和通用的方法。我们的方法包括特征提取模型和回归模型。特征提取模型将原始数据和阈值转换为哈姆姆空间,其中利用深层学习回归模型来利用基点(w.r.t.)的增量属性,即精度和单点值的阈值。我们为我们的模型和快速估算技术制定了专门的培训战略。我们还讨论如何处理更新工作。我们通过实验来展示我们方法的准确性和效率,并展示它如何改进查询优化器的性能。