Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems, such as columnstore compression and data profiling. In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples. Such efficient estimation is critical for tasks where it is prohibitive to scan the data even once. Existing sample-based estimators typically rely on heuristics or assumptions and do not have robust performance across different datasets as the assumptions on data can easily break. On the other hand, deriving an estimator from a principled formulation such as maximum likelihood estimation is very challenging due to the complex structure of the formulation. We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator. To this end, we need to answer several questions: i) how to make the learned model workload agnostic; ii) how to obtain training data; iii) how to perform model training. We derive conditions of the learning framework under which the learned model is workload agnostic, in the sense that the model/estimator can be trained with synthetically generated training data, and then deployed into any data warehouse simply as, e.g., user-defined functions (UDFs), to offer efficient (within microseconds on CPU) and accurate NDV estimations for unseen tables and workloads. We compare the learned estimator with the state-of-the-art sample-based estimators on nine real-world datasets to demonstrate its superior estimation accuracy. We publish our code for training data generation, model training, and the learned estimator online for reproducibility.
翻译:在这项工作中,我们的重点是如何从随机(在线/离线)样本中得出准确的NDV估算。这种高效估算对于即使一次扫描数据也令人望而却步的任务至关重要。 现有基于抽样的估算器通常依赖超常或假设,而且由于数据假设很容易破碎,不同数据集的估算器没有强大的性能。另一方面,由于设计结构复杂,从诸如最大可能性估算等原则性公式中得出一个估算器非常困难。我们提议在监督的学习框架内制定NDV估算任务,目的是学习模型作为测算器。为此,我们需要回答几个问题:一) 如何使所学模型工作量具有可比性;二(二) 如何获得培训数据;三) 如何进行模型培训。我们从一个有原则的公式下得出一个估算器框架,即准确的估算值是9。我们从该模型的准确性估算值中得出一个估算器的估算器,从一个受监督的模型中得出其估算任务,以学习模型作为测算的模型。