Powered by advances in deep learning (DL) techniques, machine learning and artificial intelligence have achieved astonishing successes. However, the rapidly growing needs for DL also led to communication- and resource-intensive distributed training jobs for large-scale DL training, which are typically deployed over GPU clusters. To sustain the ever-increasing demand for DL training, the so-called "ring-all-reduce" (RAR) technologies have recently emerged as a favorable computing architecture to efficiently process network communication and computation load in GPU clusters. The most salient feature of RAR is that it removes the need for dedicated parameter servers, thus alleviating the potential communication bottleneck. However, when multiple RAR-based DL training jobs are deployed over GPU clusters, communication bottlenecks could still occur due to contentions between DL training jobs. So far, there remains a lack of theoretical understanding on how to design contention-aware resource scheduling algorithms for RAR-based DL training jobs, which motivates us to fill this gap in this work. Our main contributions are three-fold: i) We develop a new analytical model that characterizes both communication overhead related to the worker distribution of the job and communication contention related to the co-location of different jobs; ii) Based on the proposed analytical model, we formulate the problem as a non-convex integer program to minimize the makespan of all RAR-based DL training jobs. To address the unique structure in this problem that is not amenable for optimization algorithm design, we reformulate the problem into an integer linear program that enables provable approximation algorithm design called SJF-BCO (Smallest Job First with Balanced Contention and Overhead); and iii) We conduct extensive experiments to show the superiority of SJF-BCO over existing schedulers.
翻译:由深层次学习(DL)技术、机器学习和人工智能的进步推动的深层次学习(DL)技术、机器学习和人工智能取得了惊人的成功。然而,对DL的需求迅速增长,也导致大规模DL培训的沟通和资源密集分配培训工作,这些培训工作通常是在GPU集群中部署的。为了维持对DL培训不断增长的需求,所谓的“环-万能计算”(RAR)技术最近成为高效处理网络通信和GPU集群计算负荷的有利计算架构。RAR的最突出特征是它消除了专用参数服务器的需要,从而缓解了潜在的通信瓶颈。然而,当基于RAR的DL培训工作在GPO集群中部署多个基于RAR的DL培训工作时,通信瓶颈仍然会发生。 如何为基于RAR的DL培训工作设计有争议的资源调度算法,这促使我们填补这项工作的这一差距。我们的主要贡献是三重:我们开发了一个新的分析模型模型,将基于S-RL的逻辑的逻辑计算,而S-RL的计算流程的逻辑上,我们为S-allialalalalalalalalalalalalalalalalalalalalal 的计算工作在与S的流程设计流程设计工作上都一个与S-S-S-Cal-L的工作上,我们之间的流程流程流程流程程序在S-al-al-al-al-altra-altra droal-al-altra drob 上,我们之间,一个与S-al-altra Raltra-altra drobaltra)在与S-altra一个与S的流程流程流程程序上与S的流程程序上与Sal-al-al-al-al-al-al-al-al-ald-al-al-al-al-al-al-al-al-al-al-al-al-al-al-ald-al-al-albalbalbalbalbal-al-altrad-al-al-al-al-al-al-al-al-al-al-al-al-al-al-al-al-al-al-al-altrad-al-al-al-al-al-al-al-al-al-