Distributed dataflow systems like Spark and Flink enable the use of clusters for scalable data analytics. While runtime prediction models can be used to initially select appropriate cluster resources given target runtimes, the actual runtime performance of dataflow jobs depends on several factors and varies over time. Yet, in many situations, dynamic scaling can be used to meet formulated runtime targets despite significant performance variance. This paper presents Enel, a novel dynamic scaling approach that uses message propagation on an attributed graph to model dataflow jobs and, thus, allows for deriving effective rescaling decisions. For this, Enel incorporates descriptive properties that capture the respective execution context, considers statistics from individual dataflow tasks, and propagates predictions through the job graph to eventually find an optimized new scale-out. Our evaluation of Enel with four iterative Spark jobs shows that our approach is able to identify effective rescaling actions, reacting for instance to node failures, and can be reused across different execution contexts.
翻译:Spark 和 Flink 等分布式数据流系统允许使用可缩放数据分析的组群。 虽然运行时间预测模型可以用于初步选择目标运行时间, 但数据流工作的实际运行时间性能取决于几个因素, 并随时间而变化。 然而, 在许多情况下, 尽管性能差异很大, 动态的缩放可以用来达到设定的运行时间目标 。 本文展示了 Enel, 这是一种新型的动态缩放方法, 它将信息传播用在模型数据流作业的可分配图表上, 从而可以产生有效的调整决定 。 为此, Enel 包含描述性能, 以捕捉各自的执行环境, 考虑单个数据流任务的统计数据, 并通过工作图表传播预测, 最终找到最佳的新缩放。 我们对 Enel 的四种迭接的调试工作评估表明, 我们的方法能够确定有效的调整调整动作, 以例举为节点失败, 并可以在不同的执行环境中重新使用 。