Document database systems store self-describing records, such as JSON, "as-is" without requiring the users to pre-define a schema. This provides users with the flexibility to change the structure of incoming records without worrying about taking the system offline or hindering the performance of currently running queries. However, the flexibility of such systems does not come without a cost. The large amount of redundancy in the stored records can introduce an unnecessary storage overhead and impact query performance. Our focus in this paper is to address the storage overhead issue by introducing a tuple compactor framework that infers and extracts the schema from self-describing records during the data ingestion process. As many prominent document store systems, such as MongoDB and Couchbase, adopt Log Structured Merge (LSM) trees in their storage engines, our framework exploits LSM lifecycle events to piggyback the schema inference and extraction operations. We have implemented and empirically evaluated our approach to measure its impact on storage, data ingestion, and query performance in the context of Apache AsterixDB.
翻译:文档数据库系统存储自我描述记录,如JSON,“As-is” 等文档数据库系统存储自我描述记录,而不要求用户事先确定一个计划。这为用户提供了改变输入记录结构的灵活性,而不必担心系统脱线或妨碍当前查询的运行。然而,这种系统的灵活性并非没有成本。存储记录中的大量冗余可能带来不必要的存储间接费用和影响查询性能。我们本文件的重点是通过引入一个图普式压缩机框架来解决存储间接费用问题,该框架在数据摄入过程中从自我输入记录中推断和提取 schema。许多突出的文件存储系统,如MongoDB和Couchbase, 在其存储引擎中采用log结构Merge(LSM)树,我们的框架利用LSM生命周期事件来利用SM的预估测和提取操作。我们实施并用经验评估了我们测量其在阿帕契AsterixDB范围内储存、数据摄取和查询性能的方法。