In the era of data explosion, a growing number of data-intensive computing frameworks, such as Apache Hadoop and Spark, have been proposed to handle the massive volume of unstructured data in parallel. Since programming models provided by these frameworks allow users to specify complex and diversified user-defined functions (UDFs) with predefined operations, the grand challenge of tuning up entire system performance arises if programmers do not fully understand the semantics of code, data, and runtime systems. In this paper, we design a holistic semantics-aware optimization for data-intensive applications using hybrid program analysis} (SODA) to assist programmers to tune performance issues. SODA is a two-phase framework: the offline phase is a static analysis that analyzes code and performance profiling data from the online phase of prior executions to generate a parameterized and instrumented application; the online phase is a dynamic analysis that keeps track of the application's execution and collects runtime information of data and system. Extensive experimental results on four real-world Spark applications show that SODA can gain up to 60%, 10%, 8%, faster than its original implementation, with the three proposed optimization strategies, i.e., cache management, operation reordering, and element pruning, respectively.
翻译:在数据爆炸时代,越来越多的数据密集型计算框架,如Apache Hadoop和Spark,被提议并行处理大量非结构化数据。由于这些框架提供的编程模型使用户能够指定复杂和多样化的用户定义功能(UDF),并预先界定操作,如果程序设计者不完全理解代码、数据和运行时间系统的语义,则整个系统性能调整的巨大挑战就会产生。在本文中,我们设计了一个数据密集型应用的整体语义系统优化,使用混合程序分析}(SODA)协助程序设计员调和业绩问题。SODA是一个两阶段框架:离线阶段是一个静态分析阶段,分析前处决在线阶段的代码和性能特征分析数据,以产生参数化和仪器化应用程序;在线阶段是一个动态分析,以跟踪应用程序的执行并收集数据和系统的运行时间信息。四个现实世界Spark应用的广泛实验结果显示,SODADA可以达到60%、10%、8%、8%、比最初实施速度,并分别使用三种优化战略。