The advent of the transformer has sparked a quick growth in the size of language models, far outpacing hardware improvements. (Dense) transformers are expected to reach the trillion-parameter scale in the near future, for which training requires thousands or even tens of thousands of GPUs. We investigate the challenges of training at this scale and beyond on commercially available hardware. In particular, we analyse the shortest possible training time for different configurations of distributed training, leveraging empirical scaling laws for language models to estimate the optimal (critical) batch size. Contrary to popular belief, we find no evidence for a memory wall, and instead argue that the real limitation -- other than the cost -- lies in the training duration. In addition to this analysis, we introduce two new methods, \textit{layered gradient accumulation} and \textit{modular pipeline parallelism}, which together cut the shortest training time by half. The methods also reduce data movement, lowering the network requirement to a point where a fast InfiniBand connection is not necessary. This increased network efficiency also improve on the methods introduced with the ZeRO optimizer, reducing the memory usage to a tiny fraction of the available GPU memory.
翻译:变压器的出现引发了语言模式规模的快速增长,硬件的改进速度远远超过了速度。 (强烈)变压器预计在不久的将来将达到万亿参数的尺度,为此培训需要数千甚至数万个GPU。 我们调查了在这种规模和范围以外就商业上可获得的硬件进行培训的挑战。 特别是, 我们分析了不同分布式培训配置的最短培训时间, 利用语言模型的经验化比例法来估计最佳( 关键) 批量大小。 与民众的信念相反, 我们没有发现记忆墙的证据, 相反, 我们发现真正的限制 -- -- 除了成本以外 -- 在于培训期限。 除了这一分析外, 我们引入了两种新方法, 即:\ textit{ 层梯度积累} 和\ textitit{ 管道平行 }, 将最短的培训时间减少一半。 方法还减少了数据移动, 将网络需求降低到一个不需要快速的点 。 这种提高的网络效率也提高了与 ZeRO 优化器所引入的方法, 将记忆使用率降低到微小的GPU。