Many deep learning applications benefit from using large models with billions of parameters. Training these models is notoriously expensive due to the need for specialized HPC clusters. In this work, we consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions. We analyze the performance of existing model-parallel algorithms in these conditions and find configurations where training larger models becomes less communication-intensive. Based on these findings, we propose SWARM parallelism, a model-parallel training algorithm designed for poorly connected, heterogeneous and unreliable devices. SWARM creates temporary randomized pipelines between nodes that are rebalanced in case of failure. We empirically validate our findings and compare SWARM parallelism with existing large-scale training approaches. Finally, we combine our insights with compression strategies to train a large Transformer language model with 1B shared parameters (approximately 13B before sharing) on preemptible T4 GPUs with less than 200Mb/s network.
翻译:许多深层次的学习应用从使用具有数十亿参数的大型模型中受益。培训这些模型由于需要专门的HPC集群而费用昂贵。在这项工作中,我们考虑培训大型模型的替代设置:使用廉价的“可允许性”实例或汇集来自多个区域的现有资源。我们分析了这些条件下现有模型平行算法的性能,并找到了培训较大模型的通信密集程度降低的配置。根据这些发现,我们提出了SWARM平行法,这是为连接不畅通、混杂和不可靠的装置设计的模型平行培训算法。SWARM在出现故障时重新平衡的节点之间创建了临时随机化管道。我们从经验上验证了我们的调查结果,并将SWARM平行法与现有的大规模培训方法进行了比较。最后,我们把我们的洞察与压缩战略结合起来,用1B的共享参数(在共享之前约13B)来培训一个大型变异语言模型,即具有低于200Mb/s网络的可允许性T4 GPPUs。