Physical data layout is an important performance factor for modern databases. Clustering, i.e., storing similar values in proximity, can lead to performance gains in several ways. We present an automated model to determine beneficial clustering columns and a clustering algorithm for the column-oriented, memory-resident database Hyrise. To automatically select clustering columns, the model analyzes the database's workload and provides estimates by how much certain clustering columns would impact the workload's latency. We evaluate the precision of the model's estimates, as well as the overall quality of its clustering suggestions. To apply a determined clustering configuration, we developed an online clustering algorithm. The clustering algorithm supports an arbitrary number of clustering dimensions. We show that the algorithm is robust against concurrently running data modifying queries. We obtain a 5% latency reduction for the TPC-H benchmark when clustering the lineitem table and a 4% latency reduction for the TPC-DS benchmark when clustering the store_sales table.
翻译:物理数据布局是现代数据库的一个重要性能要素。 分组, 即将相似值储存在附近, 可以通过几种方式带来绩效收益。 我们提出了一个自动模型, 用来确定有益的分组列和为专列、 内存- 常住数据库 Hyrise 进行分组算法。 要自动选择分组列, 模型分析数据库的工作量, 并估计某些分组列会影响工作量的潜值。 我们评估模型估计数的准确性, 以及其组合建议的整体质量。 为了应用确定的组合组合配置, 我们开发了一个在线组合算法。 组合算法支持任意数量的组合维度。 我们显示算法对于同时进行数据修改查询是强有力的。 当组合项目表时, 我们获得了TPC- H 基准的5%的延度减少值, 当组合存储- 销售表时, 我们获得了 TPC- DS 基准的4%的延度减少值 。