With the growing constraints on power budget and increasing hardware failure rates, the operation of future exascale systems faces several challenges. Towards this, resource awareness and adaptivity by enabling malleable jobs has been actively researched in the HPC community. Malleable jobs can change their computing resources at runtime and can significantly improve HPC system performance. However, due to the rigid nature of popular parallel programming paradigms such as MPI and lack of support for dynamic resource management in batch systems, malleable jobs have been largely unrealized. In this paper, we extend the SLURM batch system to support the execution and batch scheduling of malleable jobs. The malleable applications are written using a new adaptive parallel paradigm called Invasive MPI which extends the MPI standard to support resource-adaptivity at runtime. We propose two malleable job scheduling strategies to support performance-aware and power-aware dynamic reconfiguration decisions at runtime. We implement the strategies in SLURM and evaluate them on a production HPC system. Results for our performance-aware scheduling strategy show improvements in makespan, average system utilization, average response, and waiting times as compared to other scheduling strategies. Moreover, we demonstrate dynamic power corridor management using our power-aware strategy.
翻译:由于电力预算日益受到限制,硬件故障率不断提高,未来大规模系统的运行面临若干挑战。为此,在高电联社区积极研究资源意识和适应性,通过提供可流动的工作,使资源具有可移动性。可流动的工作可以在运行时改变其计算资源,并大大改善高电联系统的性能。然而,由于诸如MPI等流行的平行方案拟定模式的僵硬性质,以及缺乏对批量系统动态资源管理的支持,可流动的工作基本上没有实现。在本文件中,我们扩展了SLURM批量系统,以支持可流动工作的执行和批次时间安排。可流动的应用程序是使用称为入侵性MPI的新的适应性平行模式编写的,该模式将扩展MPI标准以支持运行时的资源适应性。我们提出了两种可移动的工作时间安排战略,以支持业绩意识和动力动态重组决策的运行。我们在SLURMM实施战略,并在生产高电联系统上评价这些战略。我们的绩效规划战略显示,在制造系统、平均系统利用、平均反应和等待能力战略方面有所改进。我们利用动态走廊展示了我们的动态管理战略。