Easy-to-use programming interfaces paired with cloud-scale processing engines have enabled big data system users to author arbitrarily complex analytical jobs over massive volumes of data. However, as the complexity and scale of analytical jobs increase, they encounter a number of unforeseen problems, hotspots with large intermediate data on temporary storage, longer job recovery time after failures, and worse query optimizer estimates being examples of issues that we are facing at Microsoft. To address these issues, we propose Phoebe, an efficient learning-based checkpoint optimizer. Given a set of constraints and an objective function at compile-time, Phoebe is able to determine the decomposition of job plans, and the optimal set of checkpoints to preserve their outputs to durable global storage. Phoebe consists of three machine learning predictors and one optimization module. For each stage of a job, Phoebe makes accurate predictions for: (1) the execution time, (2) the output size, and (3) the start/end time taking into account the inter-stage dependencies. Using these predictions, we formulate checkpoint optimization as an integer programming problem and propose a scalable heuristic algorithm that meets the latency requirement of the production environment. We demonstrate the effectiveness of Phoebe in production workloads, and show that we can free the temporary storage on hotspots by more than 70% and restart failed jobs 68% faster on average with minimum performance impact. Phoebe also illustrates that adding multiple sets of checkpoints is not cost-efficient, which dramatically reduces the complexity of the optimization.
翻译:与云度处理引擎相配的简单易用的编程界面使数据系统用户能够任意地为大量数据编写复杂分析工作。然而,随着分析工作的复杂性和规模的增加,他们遇到了一些意外问题,即临时储存、失败后延长工作恢复时间等大量中间数据的热点,以及更糟糕的查询优化估算是我们在微软公司面临的问题的例子。为了解决这些问题,我们建议Phoebe,一个高效学习的检查点优化器。鉴于一系列的限制因素和在编译时的客观功能,Phoebe能够确定工作计划的分解和最佳检查站组以将其产出保存到持久的全球储存。Phoebe由三个机器学习预测器和一个优化模块组成。对于工作的每个阶段,Phoebe都作出准确的预测:(1) 执行时间,(2) 产出大小,(3) 开始/结束时间,考虑到不同阶段的相互依存关系。我们利用这些预测,将检查站的优化作为调整程序问题,并提出一个可调整的可调控的热算算法,以保持其产出的精度的精度,这比我们展示了生产效率的70级质量。