In the last decade, document store database systems have gained more traction for storing and querying large volumes of semi-structured data. However, the flexibility of the document stores' data models has limited their ability to store data in a columnar-major layout - making them less performant for analytical workloads than column store relational databases. In this paper, we propose several techniques based on piggy-backing on Log-Structured Merge (LSM) tree events and tailored to document stores to store document data in a columnar layout. We first extend the Dremel format, a popular on-disk columnar format for semi-structured data, to comply with document stores' flexible data model. We then introduce two columnar layouts for organizing and storing data in LSM-based storage. We also highlight the potential of using query compilation techniques for document stores, where values' types are known only at runtime. We have implemented and evaluated our techniques to measure their impact on storage, data ingestion, and query performance in Apache AsterixDB. Our experiments show significant performance gains, improving the query execution time by orders of magnitude while minimally impacting ingestion performance.
翻译:在过去十年中,文件储存数据库系统在储存和查询大量半结构数据方面获得了更多的牵引力,然而,文件储存数据模型的灵活性限制了它们将数据储存在单列主要版式中的能力,使其在分析工作量方面的性能低于专列储存关系数据库。在本文件中,我们提出了基于在日志结构合并(LSM)上搭载技术的若干技术,并专门设计了用于在专栏版版中储存文件数据的文件储存库。我们首先扩展了Dremel格式,即半结构数据流行的在Disk分栏式格式,以遵守文件储存的灵活数据模型。我们随后在基于 LSM 的存储中引入了两个专列布局来组织和储存数据。我们还强调了对文件储存库使用查询汇编技术的潜力,因为这里的数值只是在运行时才知道。我们实施并评价了我们的技术,以衡量其对Apaci Asterix DB的储存、数据摄取和查询性的影响。我们的实验显示,取得了重大的业绩收益,按数量改进了查询时间,同时对业绩的影响最小化。