Inspired by the performance of score-based diffusion models in estimating complex text, video, and image distributions with thousands of dimensions, we introduce Accelerated Diffusion Cardest (ADC), the first joint distribution cardinality estimator based on a downsized diffusion model. To calculate the pointwise density value of data distributions, ADC's density estimator uses a formula that evaluates log-likelihood by integrating the score function, a gradient mapping which ADC has learned to efficiently approximate using its lightweight score estimator. To answer ranged queries, ADC's selectivity estimator first predicts their selectivity using a Gaussian Mixture Model (GMM), then uses importance sampling Monte Carlo to correct its predictions with more accurate pointwise density values calculated by the density estimator. ADC+ further trains a decision tree to identify the high-volume, high-selectivity queries that the GMM alone can predict very accurately, in which case it skips the correction phase to prevent Monte Carlo from adding more variance. Doing so lowers median Q-error and cuts per-query latency by 25 percent, making ADC+ usually twice as fast as Naru, arguably the state-of-the-art joint distribution cardinality estimator. Numerical experiments using well-established benchmarks show that on all real-world datasets tested, ADC+ is capable of rivaling Naru and outperforming MSCN, DeepDB, LW-Tree, and LW-NN using around 66 percent their storage space, being at least 3 times as accurate as MSCN on 95th and 99th percentile error. Furthermore, on a synthetic dataset where attributes exhibit complex, multilateral correlations, ADC and ADC+ are considerably robust while almost every other learned model suffered significant accuracy declines. In this case, ADC+ performs better than any other tested model, being 10 times as accurate as Naru on 95th and 99th percentile error.
翻译:受基于分数的扩散模型在估计具有数千维度的复杂文本、视频和图像分布方面性能的启发,我们提出了加速扩散基数估计器(ADC),这是首个基于降维扩散模型的联合分布基数估计器。为计算数据分布的点态密度值,ADC的密度估计器采用了一种通过积分分数函数来评估对数似然的公式;该分数函数是一个梯度映射,ADC已学会使用其轻量级分数估计器进行高效逼近。为回答范围查询,ADC的选择性估计器首先使用高斯混合模型(GMM)预测其选择性,随后采用重要性采样蒙特卡洛方法,利用密度估计器计算出的更精确的点态密度值来校正预测结果。ADC+进一步训练决策树以识别那些仅凭GMM即可高精度预测的高容量、高选择性查询,在此情况下跳过校正阶段,避免蒙特卡洛方法引入额外方差。此举将中位数Q误差降低了25%,并使每查询延迟减少25%,使得ADC+通常比当前最先进的联合分布基数估计器Naru快一倍。使用成熟基准进行的数值实验表明,在所有测试的真实世界数据集上,ADC+能够与Naru相媲美,并以约66%的存储空间优于MSCN、DeepDB、LW-Tree和LW-NN,在95%和99%分位数误差上至少比MSCN精确3倍。此外,在属性呈现复杂多边相关性的合成数据集上,ADC与ADC+表现出显著鲁棒性,而几乎所有其他学习模型均出现精度显著下降。在此情况下,ADC+优于所有其他测试模型,在95%和99%分位数误差上比Naru精确10倍。