Document database systems store self-describing semi-structured records, such as JSON, "as-is" without requiring the users to pre-define a schema. This provides users with the flexibility to change the structure of incoming records without worrying about taking the system offline or hindering the performance of currently running queries. However, the flexibility of such systems does not free. The large amount of redundancy in the records can introduce an unnecessary storage overhead and impact query performance. Our focus in this paper is to address the storage overhead issue by introducing a tuple compactor framework that infers and extracts the schema from self-describing semi-structured records during the data ingestion. As many prominent document stores, such as MongoDB and Couchbase, adopt Log Structured Merge (LSM) trees in their storage engines, our framework exploits LSM lifecycle events to piggyback the schema inference and extraction operations. We have implemented and empirically evaluated our approach to measure its impact on storage, data ingestion, and query performance in the context of Apache AsterixDB.
翻译:文档数据库系统储存自我描述半结构化记录,如JSON, “As-is” 等文档数据库系统储存自我描述半结构化记录,而不需要用户事先确定一个系统。这为用户提供了改变输入记录结构的灵活性,而不必担心系统脱线或妨碍当前查询的运行。然而,这种系统的灵活性并不自由。记录中的大量冗余可能会带来不必要的存储间接费用和影响查询性能。我们本文件的重点是通过引入一个图普式压缩机框架来解决存储间接费用问题,该框架在数据接收期间从自封半结构化记录中推断和提取出系统。许多突出的文件库,如MongoDB和Couchbase,在其存储引擎中采用log-结构Merge(LSM)树,我们的框架利用LSM生命周期事件来搭载 schema 推断和提取操作。我们实施并用经验评价了我们衡量其在阿帕契·阿斯特利克斯DB背景下储存、数据摄取和查询性效果的方法。