Columnar storage is one of the core components of a modern data analytics system. Although many database management systems (DBMSs) have proprietary storage formats, most provide extensive support to open-source storage formats such as Parquet and ORC to facilitate cross-platform data sharing. But these formats were developed over a decade ago, in the early 2010s, for the Hadoop ecosystem. Since then, both the hardware and workload landscapes have changed significantly. In this paper, we revisit the most widely adopted open-source columnar storage formats (Parquet and ORC) with a deep dive into their internals. We designed a benchmark to stress-test the formats' performance and space efficiency under different workload configurations. From our comprehensive evaluation of Parquet and ORC, we identify design decisions advantageous with modern hardware and real-world data distributions. These include using dictionary encoding by default, favoring decoding speed over compression ratio for integer encoding algorithms, making block compression optional, and embedding finer-grained auxiliary data structures. Our analysis identifies important considerations that may guide future formats to better fit modern technology trends.
翻译:列存储是现代数据分析系统的核心组件之一。尽管许多数据库管理系统(DBMSs)有专有的存储格式,但大多数提供了广泛的支持开源存储格式,如Parquet和ORC,以便于跨平台数据共享。但是这些格式是在十年前,在2010年代初为Hadoop生态系统开发的。自那时以来,硬件和工作负载领域都发生了重大变化。在本文中,我们重新审视了最广泛采用的开源列存储格式(Parquet和ORC),并深入了解它们的内部。我们设计了一个基准测试以在不同工作负载配置下测试格式的性能和空间效率。从我们对Parquet和ORC的全面评估中,我们确定了一些比较适用于现代硬件和现实世界数据分布的设计决策。包括默认使用字典编码,优先考虑整数编码算法的解码速度而不是压缩比率,将块压缩作为可选项,以及嵌入更细粒度的辅助数据结构等。我们的分析确定了重要的考虑因素,可以引导未来的格式更好地适应现代技术趋势。