Cardinality estimation is one of the most fundamental and challenging problems in query optimization. Neither classical nor learning-based methods yield satisfactory performance when estimating the cardinality of the join queries. They either rely on simplified assumptions leading to ineffective cardinality estimates or build large models to understand the data distributions, leading to long planning times and a lack of generalizability across queries. In this paper, we propose a new framework FactorJoin for estimating join queries. FactorJoin combines the idea behind the classical join-histogram method to efficiently handle joins with the learning-based methods to accurately capture attribute correlation. Specifically, FactorJoin scans every table in a DB and builds single-table conditional distributions during an offline preparation phase. When a join query comes, FactorJoin translates it into a factor graph model over the learned distributions to effectively and efficiently estimate its cardinality. Unlike existing learning-based methods, FactorJoin does not need to de-normalize joins upfront or require executed query workloads to train the model. Since it only relies on single-table statistics, FactorJoin has small space overhead and is extremely easy to train and maintain. In our evaluation, FactorJoin can produce more effective estimates than the previous state-of-the-art learning-based methods, with 40x less estimation latency, 100x smaller model size, and 100x faster training speed at comparable or better accuracy. In addition, FactorJoin can estimate 10,000 sub-plan queries within one second to optimize the query plan, which is very close to the traditional cardinality estimators in commercial DBMS.
翻译:红心估计是查询优化中最根本和最具挑战性的问题之一。 在估计合并查询的基点时,古典或基于学习的方法都没有产生令人满意的业绩。 它们要么依靠简化的假设,导致基本信息估计无效,要么依靠简化的假设,导致基本信息估计无效,或者建立大型模型,以了解数据分布,导致规划时间过长,缺乏跨查询的通用性。 在本文中,我们提议一个新的框架PrcJoin来估计合并查询。 CompJoin将传统合并历史方法背后的高效处理理念与基于学习的方法结合起来,以准确获取属性关联。 具体而言,CrcJoin对DB中的每个表格进行扫描,并在离线准备阶段建立单一的固定的有条件分布。当合并查询出现时,CrcJoin将其转化为一个因数图表模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型,以高效的方式处理,或者要求完成查询模型。由于它仅仅依靠单表统计,BirkJoin的每个表格都小于空间顶部,并且非常容易在离线准备下进行100次的精确的精确的分布。在100次的估算中,因此,SILIL的估算可以以较慢地进行较慢的计算。S- Procial- 。S-x在前的进度进行较慢的计算,在前的深度的深度的深度的深度的计算方法可以产生较慢地进行较慢地进行较慢的计算,在100次的深度的深度的深度的计算。S- 。