Parallel training of neural networks at scale is challenging due to significant overheads arising from communication. Recently, deep learning researchers have developed a variety of pruning algorithms that are capable of pruning (i.e. setting to zero) 80-90% of the parameters in a neural network to yield sparse subnetworks that equal the accuracy of the unpruned parent network. In this work, we propose a novel approach that exploits these sparse subnetworks to optimize the memory utilization and communication in two popular algorithms for parallel deep learning namely -- data and inter-layer parallelism. We integrate our approach into AxoNN, a highly scalable framework for parallel deep learning that relies on data and inter-layer parallelism, and demonstrate the reduction in communication time and memory utilization. On 512 NVIDIA V100 GPUs, our optimizations reduce the memory consumption of a 2.7 billion parameter model by 74%, and the total communication time by 40%, thus providing an overall speedup of 34% over AxoNN, 32% over DeepSpeed-3D and 46% over Sputnik, a sparse matrix computation baseline.
翻译:大规模神经网络平行培训具有挑战性,因为通信带来的大量间接费用。最近,深层学习研究人员开发了各种运行算法,这些算法可以运行(即将神经网络参数的80-90%设定为零)80-90%,以产生与未运行的父网络精度相等的稀薄子网络。在这项工作中,我们提出了一个新颖的方法,利用这些稀疏的子网络优化记忆利用和通信,在两个常用的平行深层学习算法中,即数据和跨层平行学习。我们将我们的方法纳入AxoNN,这是一个高度可扩展的平行深层学习框架,依靠数据和跨层平行平行学习,并显示通信时间和记忆利用的减少。关于512 NVIDIA V100 GPUs,我们的优化将270亿参数模型的记忆消耗减少74%,通信总时间减少40%,从而提供了比AxONN(34%)、DeepSpeed-3D(32%)和Sputnik(Sputnik)(一个分散的矩阵计算基线)的总体速度达到34%。</s>