Cardinality estimation is a fundamental but long unresolved problem in query optimization. Recently, multiple papers from different research groups consistently report that learned models have the potential to replace existing cardinality estimators. In this paper, we ask a forward-thinking question: Are we ready to deploy these learned cardinality models in production? Our study consists of three main parts. Firstly, we focus on the static environment (i.e., no data updates) and compare five new learned methods with eight traditional methods on four real-world datasets under a unified workload setting. The results show that learned models are indeed more accurate than traditional methods, but they often suffer from high training and inference costs. Secondly, we explore whether these learned models are ready for dynamic environments (i.e., frequent data updates). We find that they cannot catch up with fast data up-dates and return large errors for different reasons. For less frequent updates, they can perform better but there is no clear winner among themselves. Thirdly, we take a deeper look into learned models and explore when they may go wrong. Our results show that the performance of learned methods can be greatly affected by the changes in correlation, skewness, or domain size. More importantly, their behaviors are much harder to interpret and often unpredictable. Based on these findings, we identify two promising research directions (control the cost of learned models and make learned models trustworthy) and suggest a number of research opportunities. We hope that our study can guide researchers and practitioners to work together to eventually push learned cardinality estimators into real database systems.
翻译:红心估计是一个根本性的但长期未解决的查询优化问题。 最近, 来自不同研究团体的多份文件不断报告, 学到的模型有可能取代现有的基本估计值。 在本文中,我们提出了一个前瞻性的思考问题: 我们是否准备在生产过程中部署这些学到的基本模型? 我们的研究由三个主要部分组成。 首先, 我们侧重于静态环境( 即没有数据更新), 比较五个新学的方法和四个真实世界数据集的八种传统方法, 在一个统一的工作量设置下。 结果显示, 学习过的模型确实比传统方法更准确, 但它们往往受到高培训和推断成本的影响。 其次, 我们探索这些学习过的模型是否适合动态环境( 即频繁的数据更新)? 我们发现这些模型无法赶上快速的数据更新, 并因不同的原因返回大错误。 由于更新频率较少, 它们表现得更好, 但是它们本身没有明显的赢家。 第三, 我们更深入地审视了学习过的模型, 当它们可能出错误的时候, 我们的深度分析结果显示, 学习方法的绩效会大大地影响到相关环境( ) 、 更可靠的研究方向, 更精确地显示, 我们的学习到更深层次的模型会发现, 方向, 更深层次的模型会发现, 我们的学习到更深层次的路径上, 我们的路径的路径的路径的概率的路径的路径,, 我们的路径会发现。