Modern big data frameworks (such as Hadoop and Spark) allow multiple users to do large-scale analysis simultaneously. Typically, users deploy Data-Intensive Workflows (DIWs) for their analytical tasks. These DIWs of different users share many common parts (i.e, 50-80%), which can be materialized to reuse them in future executions. The materialization improves the overall processing time of DIWs and also saves computational resources. Current solutions for materialization store data on Distributed File Systems (DFS) by using a fixed data format. However, a fixed choice might not be the optimal one for every situation. For example, it is well-known that different data fragmentation strategies (i.e., horizontal, vertical or hybrid) behave better or worse according to the access patterns of the subsequent operations. In this paper, we present a cost-based approach which helps deciding the most appropriate storage format in every situation. A generic cost-based storage format selector framework considering the three fragmentation strategies is presented. Then, we use our framework to instantiate cost models for specific Hadoop data formats (namely SequenceFile, Avro and Parquet), and test it with realistic use cases. Our solution gives on average 33% speedup over SequenceFile, 11% speedup over Avro, 32% speedup over Parquet, and overall, it provides upto 25% performance gain.
翻译:现代大数据框架( 如 Hadoop 和 Spark ) 允许多个用户同时进行大规模分析。 通常, 用户部署数据密集工作流程( DIWs ) 以完成分析任务。 这些不同用户的DIW 共享许多共同部分( 即, 50- 80 % ), 可以实现, 在未来执行时再使用它们。 物质化提高了 DIW 的总体处理时间, 并节省了计算资源 。 目前, 使用固定数据格式, 在分布式文件系统( DFS) 上进行物质化存储数据的方法 。 但是, 固定的选择可能不是每种情况下的最佳选择。 例如, 众所周知, 不同的数据碎裂战略( 即, 水平、 垂直或混合) 与随后操作的存取模式不同。 在本文中, 我们提出了一个基于成本的方法, 帮助决定每种情况下最合适的存储格式。 一种基于成本的存储格式选择框架, 考虑三种碎裂战略。 然后, 我们使用我们的框架将快速成本模型用于特定的 Hadopopopopopy Paly Pal- passquequequel 和 Avoquest viquel 。