Query optimization is a pivotal part of every database management system (DBMS) since it determines the efficiency of query execution. Numerous works have introduced Machine Learning (ML) techniques to cost modeling, cardinality estimation, and end-to-end learned optimizer, but few of them are proven practical due to long training time, lack of interpretability, and integration cost. A recent study provides a practical method to optimize queries by recommending per-query hints but it suffers from two inherited problems. First, it follows the regression framework to predict the absolute latency of each query plan, which is very challenging because the latencies of query plans for a certain query may span multiple orders of magnitude. Second, it requires training a model for each dataset, which restricts the application of the trained models in practice. In this paper, we propose COOOL to predict Cost Orders of query plans to cOOperate with DBMS by Learning-To-Rank. Instead of estimating absolute costs, COOOL uses ranking-based approaches to compute relative ranking scores of the costs of query plans. We show that COOOL is theoretically valid to distinguish query plans with different latencies. We implement COOOL on PostgreSQL, and extensive experiments on join-order-benchmark and TPC-H data demonstrate that COOOL outperforms PostgreSQL and state-of-the-art methods on single-dataset tasks as well as a unified model for multiple-dataset tasks. Our experiments also shed some light on why COOOL outperforms regression approaches from the representation learning perspective, which may guide future research.
翻译:查询优化是每个数据库管理系统(DBMS)的关键部分,因为它决定了查询执行的效率。许多工作已经引入了机器学习(ML)技术来进行成本建模、基数估计和端到端学习的优化器,但由于长时间的训练时间、缺乏可解释性和集成成本,这些工作中很少有被证明是实际可行的。最近的一项研究提供了一种通过每个查询的提示来优化查询的实用方法,但它存在两个遗传问题。首先,它遵循回归框架来预测每个查询计划的绝对延迟,这是非常具有挑战性的,因为对于某个查询,查询计划的延迟可能跨越多个数量级。其次,它需要针对每个数据集训练模型,这限制了在实践中应用训练模型的范围。在本文中,我们提出了COOOL来通过学习排序来预测查询计划的成本顺序,以协作DBMS。COOOL使用基于排名的方法来计算查询计划成本的相对排名分数,而不是估计绝对成本。我们展示了COOOL在理论上是有效的,可以区分具有不同延迟的查询计划。我们在PostgreSQL上实现了COOOL,并在join-order-benchmark和TPC-H数据上进行了广泛的实验,证明COOOL在单个数据集任务以及多个数据集任务的统一模型上优于PostgreSQL和最先进的方法。我们的实验也从表示学习的角度阐明了COOOL为什么优于回归方法,这可能指导未来的研究。