Indexes facilitate efficient querying when the selection predicate is on an indexed key. As a result, when loading data, if we anticipate future selective (point or range) queries, we typically maintain an index that is gradually populated as new data is ingested. In that respect, indexing can be perceived as the process of adding structure to an incoming, otherwise unsorted, data collection. The process of adding structure comes at a cost, as instead of simply appending incoming data, every new entry is inserted into the index. If the data ingestion order matches the indexed attribute order, the ingestion cost is entirely redundant and can be avoided (e.g., via bulk loading in a B+-tree). However, state-of-the-art index designs do not benefit when data is ingested in an order that is close to being sorted but not fully sorted. In this paper, we study how indexes can benefit from partial data sortedness or near-sortedness, and we propose an ensemble of techniques that combine bulk loading, index appends, variable node fill/split factor, and buffering, to optimize the ingestion cost of a tree index in presence of partial data sortedness. We further augment the proposed design with necessary metadata structures to ensure competitive read performance. We apply the proposed design paradigm on a state-of-the-art B+-tree, and we propose the Ordered Sort-Merge tree (OSM-tree). OSM-tree outperforms the state of the art by up to 8.8x in ingestion performance in the presence of sortedness, while falling back to a B+-tree's ingestion performance when data is scrambled. OSM-tree offers competitive query performance, leading to performance benefits between 28% and 5x for mixed read/write workloads.
翻译:因此,如果我们预计未来有选择性(点或范围)查询,我们通常会保留随着新数据被摄取而逐渐成群的指数。在这方面,可以将指数化视为将结构添加到即将到来的、否则不分类的数据收集中的过程。添加结构的过程是有成本的,而不是简单地将收到的数据附加在索引键上,每个新条目都插入索引。如果数据摄入顺序符合指数化的属性顺序,则摄入成本是完全多余的,并且可以避免(例如,通过在B+树中进行批量装载)。然而,当数据正在进入接近排序但并非完全排序的进取数据收集时,那么,如果数据添加结构的过程是成本化的,而不是简单地将数据附加在索引中,那么每个新条目都插入到索引中。如果将批量装、指数化附件、可变节流/节流数据在B+树中进行,那么,当我们将在线运行的运行状态提升到设计中的运行状态时,则该状态设计中的运行状况会变得更好。