SWARM 平行主义:培训大型模型,这种模型能够产生令人惊讶的通信效率 (SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient)

Many deep learning applications benefit from using large models with billions of parameters. Training these models is notoriously expensive due to the need for specialized HPC clusters. In this work, we consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions. We analyze the performance of existing model-parallel algorithms in these conditions and find configurations where training larger models becomes less communication-intensive. Based on these findings, we propose SWARM parallelism, a model-parallel training algorithm designed for poorly connected, heterogeneous and unreliable devices. SWARM creates temporary randomized pipelines between nodes that are rebalanced in case of failure. We empirically validate our findings and compare SWARM parallelism with existing large-scale training approaches. Finally, we combine our insights with compression strategies to train a large Transformer language model with 1B shared parameters (approximately 13B before sharing) on preemptible T4 GPUs with less than 200Mb/s network.

翻译：许多深层次的学习应用从使用具有数十亿参数的大型模型中受益。培训这些模型由于需要专门的HPC集群而费用昂贵。在这项工作中,我们考虑培训大型模型的替代设置:使用廉价的“可允许性”实例或汇集来自多个区域的现有资源。我们分析了这些条件下现有模型平行算法的性能,并找到了培训较大模型的通信密集程度降低的配置。根据这些发现,我们提出了SWARM平行法,这是为连接不畅通、混杂和不可靠的装置设计的模型平行培训算法。SWARM在出现故障时重新平衡的节点之间创建了临时随机化管道。我们从经验上验证了我们的调查结果,并将SWARM平行法与现有的大规模培训方法进行了比较。最后,我们把我们的洞察与压缩战略结合起来,用1B的共享参数(在共享之前约13B)来培训一个大型变异语言模型,即具有低于200Mb/s网络的可允许性T4 GPPUs。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/