For decades, system administrators have been striving to design and tune cluster scheduling policies to improve the performance of high performance computing (HPC) systems. However, the increasingly complex HPC systems combined with highly diverse workloads make such manual process challenging, time-consuming, and error-prone. We present a reinforcement learning based HPC scheduling framework named DRAS-CQSim to automatically learn optimal scheduling policy. DRAS-CQSim encapsulates simulation environments, agents, hyperparameter tuning options, and different reinforcement learning algorithms, which allows the system administrators to quickly obtain customized scheduling policies.
翻译:几十年来,系统管理员一直努力设计并调整群集时间安排政策,以改善高性能计算系统(HPC)的性能,然而,由于高常识计算系统日益复杂,工作量也多种多样,因此手工操作过程具有挑战性、耗时和容易出错。我们提出了一个基于强化学习的HPC时间安排框架,名为DRAS-CQSim,以自动学习最佳时间安排政策。DRAS-CQSim包含模拟环境、代理、超参数调试选项和不同的强化学习算法,使系统管理员能够迅速获得定制的时间安排政策。