In the age of big data, sorting is an indispensable operation for DBMSes and similar systems. Having data sorted can help produce query plans with significantly lower run times. It also can provide other benefits like having non-blocking operators which will produce data steadily (without bursts), or operators with reduced memory footprint. Sorting may be required on any step of query processing, i.e., be it source data or intermediate results. At the same time, the data to be sorted may not fit into main memory. In this case, an external sort operator, which writes intermediate results to disk, should be used. In this paper we consider an external sort operator of the comparison-based sort type. We discuss its implementation and describe related design decisions. Our aim is to study the impact on performance of a data structure used on the merge step. For this, we have experimentally evaluated three data structures implemented inside a DBMS. Results have shown that it is worthwhile to make an effort to implement an efficient data structure for run merging, even on modern commodity computers which are usually disk-bound. Moreover, we demonstrated that using a loser tree is a more efficient approach than both the naive approach and the heap-based one.
翻译:在大数据时代, 分类对于 DBMS 和类似系统来说是不可或缺的操作。 如果对数据进行分类, 可以帮助生成运行时间大大缩短的查询计划。 它也可以提供其他好处, 比如有不阻拦操作器来稳定地( 不连发) 生成数据, 或者有记忆足迹的操作器来减少内存足。 可能需要在查询处理的任何阶段进行分类, 也就是说, 不管是源数据还是中间结果。 同时, 所要分类的数据可能不适应主内存 。 在这种情况下, 应该使用外部类操作器, 将中间结果写入磁盘。 在本文中, 我们考虑的是比较型类型的外部类操作器。 我们讨论其执行并描述相关的设计决定。 我们的目的是研究对合并步骤上使用的数据结构的性能的影响。 为此, 我们实验性地评估了在 DBMS 中执行的三个数据结构。 结果显示, 值得努力实施高效的数据结构来运行合并, 即使是在通常带有磁盘的现代商品计算机上。 此外, 我们证明, 使用失败树是一种比天性的方法更高效的方法 。