Fueled by advances in distributed deep learning (DDL), recent years have witnessed a rapidly growing demand for resource-intensive distributed/parallel computing to process DDL computing jobs. To resolve network communication bottleneck and load balancing issues in distributed computing, the so-called ``ring-all-reduce'' decentralized architecture has been increasingly adopted to remove the need for dedicated parameter servers. To date, however, there remains a lack of theoretical understanding on how to design resource optimization algorithms for efficiently scheduling ring-all-reduce DDL jobs in computing clusters. This motivates us to fill this gap by proposing a series of new resource scheduling designs for ring-all-reduce DDL jobs. Our contributions in this paper are three-fold: i) We propose a new resource scheduling analytical model for ring-all-reduce deep learning, which covers a wide range of objectives in DDL performance optimization (e.g., excessive training avoidance, energy efficiency, fairness); ii) Based on the proposed performance analytical model, we develop an efficient resource scheduling algorithm called GADGET (greedy ring-all-reduce distributed graph embedding technique), which enjoys a provable strong performance guarantee; iii) We conduct extensive trace-driven experiments to demonstrate the effectiveness of the GADGET approach and its superiority over the state of the art.
翻译:在分布式深层学习(DDL)的进步推动下,近年来对资源密集型分布式/平行计算处理DDL计算工作的需求迅速增长。为了解决网络通信瓶颈和分配式计算中的平衡问题,所谓的“环-全编辑”的分散化结构日益被采用,以消除对专用参数服务器的需求。然而,迄今为止,对于如何设计资源优化算法以有效安排计算组中的环-全减少的DDL工作,仍然缺乏理论上的理解。这促使我们通过提出一系列环-全编辑-DDL工作的新资源时间安排设计来填补这一差距。我们在本文件中的贡献有三重:一)我们提出了一个新的环-全环-全编辑的分散式结构,用于消除对专用参数服务器的需求。但迄今为止,对于如何设计资源优化资源优化资源优化算法以高效地安排计算计算计算在计算组中的环-全压缩DL工作。这促使我们提出了一系列新的资源时间安排设计。我们在本文中的贡献有三重三重:我们提出了一个新的资源配置式分析模型分析模型分析模型,其中包括DL业绩优化的大规模追踪方法。