Data lakehouses (LHs) are at the core of current cloud analytics stacks by providing elastic, relational compute on data in cloud data lakes across vendors. For relational semantics, they rely on open table formats (OTFs). Unfortunately, they have many missing features inherent to their metadata designs, like no support for multi-table transactions and recovery in case of an abort in concurrent, multi-query workloads. This, in turn, can lead to non-repeatable reads, stale data, and high costs in production cloud systems. In this work, we introduce LakeVilla, a modular toolbox that introduces recovery, complex transactions, and transaction isolation to state-of-the-art OTFs like Apache Iceberg and Delta Lake tables. We investigate its transactional guarantees and show it has minimal impact on performance (2% YCSB writes, 2.5% TPC-DS reads) and provides concurrency control for multiple readers and writers for arbitrary long transactions in OTFs in a non-invasive way.
翻译:数据湖仓(LHs)通过提供跨供应商云数据湖中数据的弹性关系式计算,构成了当前云分析栈的核心。为实现关系语义,它们依赖于开放表格式(OTFs)。遗憾的是,由于元数据设计的固有缺陷,这些格式存在许多功能缺失,例如不支持多表事务以及在并发多查询工作负载中发生中止时的恢复机制。这进而可能导致不可重复读、数据陈旧以及生产云系统中的高昂成本。本文介绍LakeVilla,这是一个模块化工具箱,可为Apache Iceberg和Delta Lake表等前沿OTFs引入恢复机制、复杂事务和事务隔离功能。我们研究了其事务保证特性,并证明其对性能影响极小(YCSB写入2%,TPC-DS读取2.5%),同时以非侵入方式为OTFs中任意长事务的多读写器提供并发控制。