Cluster scheduler is crucial in high-performance computing (HPC). It determines when and which user jobs should be allocated to available system resources. Existing cluster scheduling heuristics are developed by human experts based on their experience with specific HPC systems and workloads. However, the increasing complexity of computing systems and the highly dynamic nature of application workloads have placed tremendous burden on manually designed and tuned scheduling heuristics. More aggressive optimization and automation are needed for cluster scheduling in HPC. In this work, we present an automated HPC scheduling agent named DRAS (Deep Reinforcement Agent for Scheduling) by leveraging deep reinforcement learning. DRAS is built on a novel, hierarchical neural network incorporating special HPC scheduling features such as resource reservation and backfilling. A unique training strategy is presented to enable DRAS to rapidly learn the target environment. Once being provided a specific scheduling objective given by system manager, DRAS automatically learns to improve its policy through interaction with the scheduling environment and dynamically adjusts its policy as workload changes. The experiments with different production workloads demonstrate that DRAS outperforms the existing heuristic and optimization approaches by up to 45%.
翻译:在高性能计算(HPC)中,集束调度系统至关重要。它决定了何时和哪些用户职位应分配给可用的系统资源。现有的集束调度表由人类专家根据他们在特定高常PC系统和工作量方面的经验制定;然而,计算机系统日益复杂,应用工作量的高度动态性给手工设计和调整的排程工作带来了巨大的负担。在高业绩计算(HPC)中,集群调度需要更积极的优化和自动化。在这项工作中,我们通过利用深度加固学习,展示了名为DRAS(高级排备剂)的自动高频PC排程代理。DRAS(高级排备剂)建在一个新型的、等级的神经网络上,其中包含了特别的HPC排程特点,例如资源保留和回填。介绍了一项独特的培训战略,使DRAS能够迅速了解目标环境。一旦系统管理员提供了具体的排程目标,DRAS就自动学习如何通过与排期环境互动来改进其政策,并随着工作量的变化动态调整其政策。不同生产工作量的实验表明DRAS(DRAS)比现有的超模和优化方法达到45%。