Task graphs provide a simple way to describe scientific workflows (sets of tasks with dependencies) that can be executed on both HPC clusters and in the cloud. An important aspect of executing such graphs is the used scheduling algorithm. Many scheduling heuristics have been proposed in existing works; nevertheless, they are often tested in oversimplified environments. We provide an extensible simulation environment designed for prototyping and benchmarking task schedulers, which contains implementations of various scheduling algorithms and is open-sourced, in order to be fully reproducible. We use this environment to perform a comprehensive analysis of workflow scheduling algorithms with a focus on quantifying the effect of scheduling challenges that have so far been mostly neglected, such as delays between scheduler invocations or partially unknown task durations. Our results indicate that network models used by many previous works might produce results that are off by an order of magnitude in comparison to a more realistic model. Additionally, we show that certain implementation details of scheduling algorithms which are often neglected can have a large effect on the scheduler's performance, and they should thus be described in great detail to enable proper evaluation.
翻译:任务图提供了一种简单的方式来描述可以同时在 HPC 群集和云中执行的科学工作流程(依赖性任务组) 。 执行这些图表的一个重要方面是使用的排程算法。 许多排程法是在现有工作中提出的; 然而,它们常常在过于简化的环境中进行测试。 我们为原型和基准任务排程员设计了一个可扩展的模拟环境,其中包括各种排程算法的实施,而且是公开来源的,以便完全可以复制。 我们利用这个环境来对排程算法进行全面分析,重点是量化迄今为止大多被忽视的排程算法对排程挑战的影响,例如排程员之间的延误或部分未知的任务持续时间。 我们的结果表明,许多以前工作所用的网络模型可能会产生与更现实的模式相比程度不同的结果。 此外,我们表明,经常被忽视的排程算法的某些执行细节可能对排程算法的绩效产生很大影响,因此应当详细描述,以便能够进行正确的评价。