Distributed data analytic engines like Spark are common choices to process massive data in industry. However, the performance of Spark SQL highly depends on the choice of configurations, where the optimal ones vary with the executed workloads. Among various alternatives for Spark SQL tuning, Bayesian optimization (BO) is a popular framework that finds near-optimal configurations given sufficient budget, but it suffers from the re-optimization issue and is not practical in real production. When applying transfer learning to accelerate the tuning process, we notice two domain-specific challenges: 1) most previous work focus on transferring tuning history, while expert knowledge from Spark engineers is of great potential to improve the tuning performance but is not well studied so far; 2) history tasks should be carefully utilized, where using dissimilar ones lead to a deteriorated performance in production. In this paper, we present Rover, a deployed online Spark SQL tuning service for efficient and safe search on industrial workloads. To address the challenges, we propose generalized transfer learning to boost the tuning performance based on external knowledge, including expert-assisted Bayesian optimization and controlled history transfer. Experiments on public benchmarks and real-world tasks show the superiority of Rover over competitive baselines. Notably, Rover saves an average of 50.1% of the memory cost on 12k real-world Spark SQL tasks in 20 iterations, among which 76.2% of the tasks achieve a significant memory reduction of over 60%.
翻译:Spark SQL 等分布式数据分析引擎是处理工业中大量数据的常见选择。然而,Spark SQL 的性能高度取决于配置的选择,而这种配置的最佳方法与完成的工作量不同。在Spark SQL 调试的各种替代方法中,Bayesian优化(BO)是一个流行的框架,在预算充足的情况下发现接近最佳的配置,但它受到重新优化问题的影响,在实际生产中不切实际。在应用传输学习来加快调控进程时,我们注意到两个具体领域的挑战:(1) 先前的工作主要侧重于调控历史,而来自Spark工程师的专家知识对于改进调控工作有很大潜力,但迄今没有很好地加以研究;(2) 历史任务应当谨慎使用,因为使用不同的方法导致生产业绩下降。 在本文件中,我们介绍一个在线调控点SQL调控服务,以高效和安全地搜索工业工作量。为了应对挑战,我们提议进行普遍调控转移,以提升基于外部知识的业绩,包括专家辅助的Bayes imealalalimalalimalimalalalalal-ligalalalalal lapractation lement lagistrage lagidudududududududududududududududududududududududududududududucal 202020Sralalalalalalalal 20balticlementaltialtialalalal 20 lementaltialal le。1 Slibalalalalalaltialtialtialtial lementaltialalalalal lemental lementaltialal lemental lementalal lementaltialalalalalalalalalal lexxxx。实验,Sx20x20xSx。在Sx20xxxxxxxxx201,Sxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx平平平平比比比比比比比比比比比比比比比比比2020202040平平平平平平平平