Lotaru: 不同组群的科学工作流程任务局部估计运行时间 (Lotaru: Locally Estimating Runtimes of Scientific Workflow Tasks in Heterogeneous Clusters)

Many scientific workflow scheduling algorithms need to be informed about task runtimes a-priori to conduct efficient scheduling. In heterogeneous cluster infrastructures, this problem becomes aggravated because these runtimes are required for each task-node pair. Using historical data is often not feasible as logs are typically not retained indefinitely and workloads as well as infrastructure changes. In contrast, online methods, which predict task runtimes on specific nodes while the workflow is running, have to cope with the lack of example runs, especially during the start-up. In this paper, we present Lotaru, a novel online method for locally estimating task runtimes in scientific workflows on heterogeneous clusters. Lotaru first profiles all nodes of a cluster with a set of short-running and uniform microbenchmarks. Next, it runs the workflow to be scheduled on the user's local machine with drastically reduced data to determine important task characteristics. Based on these measurements, Lotaru learns a Bayesian linear regression model to predict a task's runtime given the input size and finally adjusts the predicted runtime specifically for each task-node pair in the cluster based on the micro-benchmark results. Due to its Bayesian approach, Lotaru can also compute robust uncertainty estimates and provides them as an input for advanced scheduling methods. Our evaluation with five real-world scientific workflows and different datasets shows that Lotaru significantly outperforms the baselines in terms of prediction errors for homogeneous and heterogeneous clusters.

翻译：许多科学工作流程排程算法需要了解任务运行时间,这是高效排程的首要任务运行时间。在混杂的组群基础设施中,这一问题变得更加严重,因为每个任务节点都需要这些运行时间。使用历史数据通常不可行, 因为日志通常不是无限期保留, 工作量以及基础设施的变化。相反, 在工作流程运行期间, 预测特定节点上的任务运行时间的在线方法, 要应对缺少示例运行的时间, 特别是在启动期间。在本文中, 我们介绍Lotaru, 这是本地估算不同组群科学工作流程中任务运行时间的新在线方法。 Lotaru 首先描述一组中所有带有一套短运行和统一的微小节点标志的节点。下一步, 它会将工作流程排在用户的本地机器上, 其数据将大量减少, 确定重要的任务特性。根据这些测量, Lotaru 学习了一种巴耶斯线性回归模型, 以预测任务运行时间, 以输入大小为背景, 并最终调整每个任务运行时间的预测时间, 具体为不同任务运行周期组组中每个任务运行周期的组合组合组合的节点, 也提供我们精确的精确的精确的日历,,, 的进度以显示我们的数据列表的精确的精确的进度, 。