Large deep learning models have shown great potential for delivering exceptional results in various applications. However, the training process can be incredibly challenging due to the models' vast parameter sizes, often consisting of hundreds of billions of parameters. Common distributed training methods, such as data parallelism, tensor parallelism, and pipeline parallelism, demand significant data communication throughout the process, leading to prolonged wait times for some machines in physically distant distributed systems. To address this issue, we propose a novel solution called Hulk, which utilizes a modified graph neural network to optimize distributed computing systems. Hulk not only optimizes data communication efficiency between different countries or even different regions within the same city, but also provides optimal distributed deployment of models in parallel. For example, it can place certain layers on a machine in a specific region or pass specific parameters of a model to a machine in a particular location. By using Hulk in experiments, we were able to improve the time efficiency of training large deep learning models on distributed systems by more than 20\%. Our open source collection of unlabeled data:https://github.com/DLYuanGod/Hulk.
翻译:大型深层学习模型显示出在各种应用中产生不同结果的巨大潜力,然而,由于模型的参数大小巨大,往往由数千亿个参数组成,培训过程可能具有巨大的挑战性。共同分布式培训方法,如数据平行、多元平行和管道平行,在整个过程中需要大量的数据通信,导致在物理上偏远分布式系统中某些机器长时间等待时间。为了解决这一问题,我们提出了一个叫“绿巨”的新解决方案,它利用一个修改过的图形神经网络优化分布式计算系统。绿巨不仅优化了不同国家之间甚至同一城市内不同区域的数据通信效率,而且还同时提供了最佳分布式模型的部署。例如,它可以在特定区域的机器上放置一定的层,或者将模型的具体参数传给某个特定地点的机器。通过实验,我们能够提高对分布式系统大型深层学习模型进行20个以上“ ” 的培训的时间效率。我们公开收集的无标签数据来源:https://github.com/DLYuanGen/Hulk。</s>