High performance computing (HPC) is undergoing significant changes. The emerging HPC applications comprise both compute- and data-intensive applications. To meet the intense I/O demand from emerging data-intensive applications, burst buffers are deployed in production systems. Existing HPC schedulers are mainly CPU-centric. The extreme heterogeneity of hardware devices, combined with workload changes, forces the schedulers to consider multiple resources (e.g., burst buffers) beyond CPUs, in decision making. In this study, we present a multi-resource scheduling scheme named BBSched that schedules user jobs based on not only their CPU requirements, but also other schedulable resources such as burst buffer. BBSched formulates the scheduling problem into a multi-objective optimization (MOO) problem and rapidly solves the problem using a multi-objective genetic algorithm. The multiple solutions generated by BBSched enables system managers to explore potential tradeoffs among various resources, and therefore obtains better utilization of all the resources. The trace-driven simulations with real system workloads demonstrate that BBSched improves scheduling performance by up to 41% compared to existing methods, indicating that explicitly optimizing multiple resources beyond CPUs is essential for HPC scheduling.
翻译:高性能计算(HPC)正在经历重大变革。新兴的HPC应用程序包括计算和数据密集型应用程序。为满足新兴数据密集型应用程序产生的大量I/O需求,在生产系统中部署了防爆缓冲。现有的HPC调度器主要以CPU为中心。硬件设备的极端异质性,加上工作量的变化,迫使排程器在决策中考虑超出CPU的多种资源(如防爆缓冲),在这个研究中,我们提出了一个名为BBBSched的多资源排期计划,不仅根据CPU的要求,而且根据其他可排期的资源(如防爆缓冲等)安排用户工作。BBSched将排期问题发展成多目标优化(MOO)问题,并使用多目标的遗传算法迅速解决问题。BBSched产生的多种解决方案使系统管理员能够探索各种资源之间的潜在权衡,从而更好地利用所有资源。以实际系统工作量为驱动的模拟显示,BBScheed的排期将业绩提高到41%,而不是现有方法的HPC。