面向网络敏感型深度学习的GPU集群调度 (GPU Cluster Scheduling for Network-Sensitive Deep Learning)

We propose a novel GPU-cluster scheduler for distributed DL (DDL) workloads that enables proximity based consolidation of GPU resources based on the DDL jobs' sensitivities to the anticipated communication-network delays. Our scheduler consists of three major components: (i) a classical delay scheduling algorithm to facilitate job placement and consolidation; (ii) a network-sensitive job preemption strategy; and (iii) an "auto-tuner" mechanism to optimize delay timers for effective delay scheduling. Additionally, to enable a cost-effective methodology for large-scale experiments, we develop a data-driven DDL cluster simulation platform. Employing the simulation platform we compare against several state-of-the-art alternatives on real-world workload traces to demonstrate the benefits of our design. Our scheduler can provide improvement of up to 69% in end-to-end Makespan for training all jobs compared to the prevailing consolidation-based scheduling methods, while reducing the average job completion time by up to 83% and minimizing the communication overheads by up to 98% under congested networking conditions.

翻译：我们提出了一种新颖的GPU集群调度器，用于分布式深度学习（DDL）工作负载，该调度器基于DDL作业对预期通信网络延迟的敏感性，实现基于邻近度的GPU资源整合。我们的调度器包含三个主要组件：（i）一种经典的延迟调度算法，以促进作业放置与整合；（ii）一种网络敏感的作业抢占策略；以及（iii）一种“自动调谐器”机制，用于优化延迟计时器以实现有效的延迟调度。此外，为了支持大规模实验的经济高效方法，我们开发了一个数据驱动的DDL集群仿真平台。利用该仿真平台，我们在真实工作负载轨迹上对比了多种先进替代方案，以证明我们设计的优势。与主流的基于整合的调度方法相比，我们的调度器在训练所有作业的端到端完工时间上最高可提升69%，同时在网络拥塞条件下，平均作业完成时间最高可减少83%，通信开销最高可降低98%。