We propose a novel GPU-cluster scheduler for distributed DL (DDL) workloads that enables proximity based consolidation of GPU resources based on the DDL jobs' sensitivities to the anticipated communication-network delays. Our scheduler consists of three major components: (i) a classical delay scheduling algorithm to facilitate job placement and consolidation; (ii) a network-sensitive job preemption strategy; and (iii) an "auto-tuner" mechanism to optimize delay timers for effective delay scheduling. Additionally, to enable a cost-effective methodology for large-scale experiments, we develop a data-driven DDL cluster simulation platform. Employing the simulation platform we compare against several state-of-the-art alternatives on real-world workload traces to demonstrate the benefits of our design. Our scheduler can provide improvement of up to 69% in end-to-end Makespan for training all jobs compared to the prevailing consolidation-based scheduling methods, while reducing the average job completion time by up to 83% and minimizing the communication overheads by up to 98% under congested networking conditions.
翻译:我们提出了一种新颖的GPU集群调度器,用于分布式深度学习(DDL)工作负载,该调度器基于DDL作业对预期通信网络延迟的敏感性,实现基于邻近度的GPU资源整合。我们的调度器包含三个主要组件:(i)一种经典的延迟调度算法,以促进作业放置与整合;(ii)一种网络敏感的作业抢占策略;以及(iii)一种“自动调谐器”机制,用于优化延迟计时器以实现有效的延迟调度。此外,为了支持大规模实验的经济高效方法,我们开发了一个数据驱动的DDL集群仿真平台。利用该仿真平台,我们在真实工作负载轨迹上对比了多种先进替代方案,以证明我们设计的优势。与主流的基于整合的调度方法相比,我们的调度器在训练所有作业的端到端完工时间上最高可提升69%,同时在网络拥塞条件下,平均作业完成时间最高可减少83%,通信开销最高可降低98%。