Parallel training of neural networks at scale is challenging due to significant overheads arising from communication. Recently, deep learning researchers have developed a variety of pruning algorithms that are capable of pruning (i.e. setting to zero) 80-90% of the parameters in a neural network to yield sparse subnetworks that equal the accuracy of the unpruned parent network. In this work, we propose a novel approach that exploits these sparse subnetworks to optimize the memory utilization and communication in two popular algorithms for parallel deep learning namely -- data and inter-layer parallelism. We integrate our approach into AxoNN, a highly scalable framework for parallel deep learning that relies on data and inter-layer parallelism, and demonstrate the reduction in communication times and memory utilization. On 512 NVIDIA V100 GPUs, our optimizations reduce the memory consumption of a 2.7 billion parameter model by 74%, and the total communication times by 40%, thus providing an overall speedup of 34% over AxoNN, 32% over DeepSpeed-3D and 46% over Sputnik, a sparse matrix computation baseline.
翻译:大规模神经网络平行培训具有挑战性,因为通信带来的大量间接费用。最近,深层学习研究人员开发了各种运行算法,这些算法可以运行(即将神经网络参数的80-90%设定为零)80-90%,以产生与未运行的父网络精度相等的稀薄子网络。在这项工作中,我们提出了一个新颖的方法,利用这些稀疏的子网络优化记忆利用和通信,在两个常用的平行深层学习算法 -- -- 数据和跨层平行学习 -- -- 即数据和跨层平行学习。我们将我们的方法纳入AxONN,这是一个高度可扩展的平行深层学习框架,依靠数据和跨层平行平行平行学习,并展示通信时间和记忆利用的减少。关于512 NVIDIA V100 GPU,我们的优化将27亿参数模型的记忆消耗减少74%,通信总时间减少40 %,从而提供了比AxONNN(34%)、DeepSpeed-3D(32%)和Sputnik(Sputnik)(一个分散的矩阵计算基线)的总体速度加快34%。