Existing general purpose frameworks for gigantic model training, i.e., models with billions to trillions of parameters, cannot scale efficiently on public cloud environments due to large communication overheads. In this paper, we propose MiCS, which Minimizes the Communication Scale to bring down communication overhead. Specifically, by decreasing the number of participants in a communication collective, MiCS can utilize existing heterogeneous network bandwidth on the cloud, reduce network traffic over slower links, and amortize expensive global gradient synchronization overheads. Our evaluation on AWS shows that the system throughput of MiCS is up to 2.89$\times$ that of the state-of-the-art large model training systems. MiCS achieves near-linear scaling efficiency, which is up to 1.27$\times$ that of DeepSpeed. MiCS allows us to train a proprietary model with 100 billion parameters on 512 GPUs with 99.4% weak-scaling efficiency, and it is able to saturate over 54.5% theoretical computation power of each GPU on a public cloud with less GPU memory and more restricted networks than DGX-A100 clusters.
翻译:大型模型培训的现有一般目的框架,即具有数十亿至数万亿参数的模型,由于通信管理费用巨大,无法在公共云层环境中有效推广。在本文中,我们提议MICS,以尽量减少通信规模来降低通信管理费用。具体地说,通过减少通信集体参与者的人数,MICS可以利用云层上现有的多种网络带宽,减少网络流量超过较慢的连接,以及摊合昂贵的全球梯度同步管理费用。我们对AWS的评估表明,MICS的系统吞吐量高达2.89美元,相当于最先进的大型模型培训系统的费用。MICS实现了近线性扩展效率,达到深层Speed的1.27美元。MICS允许我们在512个具有1 000亿个参数的通用模型上,其缩放效率为99.4%,它能够使每个GPU在公共云层上的理论计算能力达到54.5%以上,而GPU的内存量比DX-A100组的网络更受限制。