Efficient GPU resource scheduling is essential to maximize resource utilization and save training costs for the increasing amount of deep learning workloads in shared GPU clusters. Existing GPU schedulers largely rely on static policies to leverage the performance characteristics of deep learning jobs. However, they can hardly reach optimal efficiency due to the lack of elasticity. To address the problem, we propose ONES, an ONline Evolutionary Scheduler for elastic batch size orchestration. ONES automatically manages the elasticity of each job based on the training batch size, so as to maximize GPU utilization and improve scheduling efficiency. It determines the batch size for each job through an online evolutionary search that can continuously optimize the scheduling decisions. We evaluate the effectiveness of ONES with 64 GPUs on TACC's Longhorn supercomputers. The results show that ONES can outperform the prior deep learning schedulers with a significantly shorter average job completion time.
翻译:高效的GPU资源时间安排对于最大限度地利用资源并节省培训成本以适应共享的GPU群中不断增加的深层学习工作量至关重要。现有的GPU调度器主要依靠静态政策来利用深层学习岗位的性能特征,然而,由于缺乏弹性,它们几乎无法达到最佳效率。为了解决这个问题,我们提议为弹性批量调调而设的在线演进调度仪,即Online Evolution 调度仪。ANS自动根据培训批量大小管理每项工作的弹性,以便最大限度地利用GPU,提高时间安排效率。它通过在线演进搜索确定每项工作的批量规模,从而能够不断优化时间安排决定。我们评估了在TACC的长角超级计算机上使用64个GPU的ANS系统的有效性。结果显示,CONS可以大大缩短平均完成工作的时间,超过以前的深层学习时间表。