In column-oriented query processing, a materialization strategy determines when lightweight positions (row IDs) are translated into tuples. It is an important part of column-store architecture, since it defines the class of supported query plans, and, therefore, impacts the overall system performance. In this paper we continue investigating materialization strategies for a distributed disk-based column-store. We start with demonstrating cases when existing approaches impose fundamental limitations on the resulting system performance. Then, in order to address them, we propose a new hybrid materialization model. The main feature of hybrid materialization is the ability to manipulate both positions and values at the same time. This way, query engine can flexibly combine advantages of all the existing strategies and support a new class of query plans. Moreover, hybrid materialization allows the query engine to flexibly customize the materialization policy of individual attributes. We describe our vision of how hybrid materialization can be implemented in a columnar system. As an example, we use PosDB~ -- a distributed, disk-based column-store. We present necessary data structures, the internals of a hybrid operator, and describe the algebra of such operators. Based on this implementation, we evaluate performance of late, ultra-late, and hybrid materialization strategies in several scenarios based on TPC-H queries. Our experiments demonstrate that hybrid materialization is almost two times faster than its counterparts, while providing a more flexible query model.
翻译:在列式查询处理中,物质化策略确定何时将轻量级位置(行ID)转换为元组。它是列存储架构的重要组成部分,因为它定义了支持的查询计划类别,从而影响了整个系统的性能。在本文中,我们继续研究基于磁盘的分布式列式存储的物质化策略。我们首先展示了现有方法在导致系统性能方面存在根本限制的情况。为了解决这些问题,我们提出了一种新的混合物质化模型。混合物化的主要特点是能够同时操作位置和值。这样,查询引擎可以灵活地组合所有现有策略的优点,并支持一组新的查询计划。此外,混合物质化允许查询引擎灵活定制单个属性的物质化策略。我们描述了混合物质化如何在一个列式系统中实现。作为示例,我们使用了PosDB——一个分布式、基于磁盘的列式存储系统。我们提供了必要的数据结构、混合操作符的内部结构,并描述了这种操作符的代数。基于该实现,我们评估了晚期、超晚期和混合物质化策略在基于TPC-H查询的几种情况下的性能。我们的实验表明,混合物质化速度几乎比其他方法快两倍,同时提供了更灵活的查询模型。