Presto的中元元数据缓存:迈向快速数据处理 (Metadata Caching in Presto: Towards Fast Data Processing)

Presto is an open-source distributed SQL query engine for OLAP, aiming for "SQL on everything". Since open-sourced in 2013, Presto has been consistently gaining popularity in large-scale data analytics and attracting adoption from a wide range of enterprises. From the development and operation of Presto, we witnessed a significant amount of CPU consumption on parsing column-oriented data files in Presto worker nodes. This blocks some companies, including Meta, from increasing analytical data volumes. In this paper, we present a metadata caching layer, built on top of the Alluxio SDK cache and incorporated in each Presto worker node, to cache the intermediate results in file parsing. The metadata cache provides two caching methods: caching the decompressed metadata bytes from raw data files and caching the deserialized metadata objects. Our evaluation of the TPC-DS benchmark on Presto demonstrates that when the cache is warm, the first method can reduce the query's CPU consumption by 10%-20%, whereas the second method can minimize the CPU usage by 20%-40%.

翻译：Presto 是 OLAP 的开放源码分布的 SQL 查询引擎, 目标是“ 所有内容的 SQL ” 。自 2013 年开放源码以来, Presto 一直受到大型数据分析的欢迎, 并吸引来自广泛企业的采纳。从 Presto 的开发和运作中, 我们目睹了在 Presto 工人节点中解析柱型数据文档时大量CPU消耗量。这阻止了包括Meta 在内的一些公司增加分析数据量。在本文中, 我们展示了一个元数据缓存层, 建在 Alluxio SDK 缓存的顶端, 并融入了每个 Presto 工人节点, 以在文件折中存储中间结果。元数据缓存提供了两种缓存方法: 将原始数据文档中压式元数据按量递减, 并缓存断了元数据对象。我们对 Presto 的 TPC-DS 基准的评估表明, 当缓存点变暖时, 第一种方法可以将查询的 CPU 消费量减少 10%-20%- 20, 而第二种方法可以将 CPU 使用 20%- 40% 。