Spark SQL has been widely deployed in industry but it is challenging to tune its performance. Recent studies try to employ machine learning (ML) to solve this problem, but suffer from two drawbacks. First, it takes a long time (high overhead) to collect training samples. Second, the optimal configuration for one input data size of the same application might not be optimal for others. To address these issues, we propose a novel Bayesian Optimization (BO) based approach named LOCAT to automatically tune the configurations of Spark SQL applications online. LOCAT innovates three techniques. The first technique, named QCSA, eliminates the configuration-insensitive queries by Query Configuration Sensitivity Analysis (QCSA) when collecting training samples. The second technique, dubbed DAGP, is a Datasize-Aware Gaussian Process (DAGP) which models the performance of an application as a distribution of functions of configuration parameters as well as input data size. The third technique, called IICP, Identifies Important Configuration Parameters (IICP) with respect to performance and only tunes the important ones. As such, LOCAT can tune the configurations of a Spark SQL application with low overhead and adapt to different input data sizes. We employ Spark SQL applications from benchmark suites TPC-DS, TPC-H, and HiBench running on two significantly different clusters, a four-node ARM cluster and an eight-node x86 cluster, to evaluate LOCAT. The experimental results on the ARM cluster show that LOCAT accelerates the optimization procedures of the state-of-the-art approaches by at least 4.1x and up to 9.7x; moreover, LOCAT improves the application performance by at least 1.9x and up to 2.4x. On the x86 cluster, LOCAT shows similar results to those on the ARM cluster.
翻译:Spark SQL 已经在行业中广泛部署 SQL 。 最近的研究试图利用机器学习(ML) 解决这个问题,但有两个缺点。 首先, 收集培训样本需要很长的时间( 高管理) 。 第二, 同一应用程序的一个输入数据大小的最佳配置可能不是其他应用程序的最佳配置。 为了解决这些问题, 我们建议采用名为 LOCAT (BO) 的新型Bayesian Optim化(BO) 方法, 自动调整 Spark SQL 应用程序的配置。 LOCAT 创新了三种技术。 第一种技术, 名为 QCSA (QCSA), 消除了Query 配置敏感度分析(QCSA) 在收集培训样本时的配置不敏感度查询。 第二种技术, 调制DGP( ), 是一个数据缩略图- Award Gauss 进程(DGP), 将应用程序的性能作为配置参数的分布以及输入数据大小。 第三个技术, 名为 IICP, 识别重要配置参数(IICP), 有关业绩, 只标定了TROC- RodL 程序, 运行S- RDS 。