Real-time data analytics systems such as SAP HANA, MemSQL, and IBM Wildfire employ hybrid data layouts, in which data are stored in different formats throughout their lifecycle. Recent data are stored in a row-oriented format to serve OLTP workloads and support high data rates, while older data are transformed to a column-oriented format for OLAP access patterns. We observe that a Log-Structured Merge (LSM) Tree is a natural fit for a lifecycle-aware storage engine due to its high write throughput and level-oriented structure, in which records propagate from one level to the next over time. To build a lifecycle-aware storage engine using an LSM-Tree, we make a crucial modification to allow different data layouts in different levels, ranging from purely row-oriented to purely column-oriented, leading to a Real-Time LSM-Tree. We give a cost model and an algorithm to design a Real-Time LSM-Tree that is suitable for a given workload, followed by an experimental evaluation of LASER - a prototype implementation of our idea built on top of the RocksDB key-value store. In our evaluation, LASER is almost 5x faster than Postgres (a pure row-store) and two orders of magnitude faster than MonetDB (a pure column-store) for real-time data analytics workloads.
翻译:实时数据分析系统,如SAP HANNA、MemSQL和IBM Warifier等实时数据分析系统采用混合数据布局,其中数据在生命周期中以不同格式储存,数据在整个生命周期中以不同格式储存。最近的数据以面向行的格式储存,为OLTP工作量提供服务,支持高数据率,而旧数据则转换成以列为导向的OLAP访问模式格式。我们观察到,日志结构合并(LSM)树由于其高写量和级别结构,对寿命周期储存引擎是一种自然适应性,该结构将记录从一个层次传播到下一个层次。为了利用LSM-TRee建立一个生命周期记录存储引擎,我们做了一个至关重要的修改,允许不同层次的不同数据布局,从纯粹的面向行到纯粹的专栏访问模式,导致实时LSMM-Tree(LSM-Treere)树是一个成本模型和算法,它适合特定工作量,随后对LSER-SER的实验性记录进行试验性评价,这是我们SER-SRA-SER-一个比SDB最高级的SLAA级系统最高级的模型。