To train large models (like BERT and GPT-3) on hundreds of GPUs, communication has become a major bottleneck, especially on commodity systems with limited-bandwidth TCP network. On one side large batch-size optimization such as LAMB algorithm was proposed to reduce the frequency of communication. On the other side, communication compression algorithms such as 1-bit Adam help to reduce the volume of each communication. However, we find that simply using one of the techniques is not sufficient to solve the communication challenge, especially under low network bandwidth. Motivated by this we aim to combine the power of large-batch optimization and communication compression, but we find that existing compression strategies cannot be directly applied to LAMB due to its unique adaptive layerwise learning rates. To this end, we design a new communication-efficient algorithm, 1-bit LAMB, which introduces a novel way to support adaptive layerwise learning rates under compression. In addition, we introduce a new system implementation for compressed communication using the NCCL backend of PyTorch distributed, which improves both usability and performance. For BERT-Large pre-training task with batch sizes from 8K to 64K, our evaluations on up to 256 GPUs demonstrate that 1-bit LAMB with NCCL-based backend is able to achieve up to 4.6x communication volume reduction, up to 2.8x end-to-end time-wise speedup, and the same sample-wise convergence speed (and same fine-tuning task accuracy) compared to uncompressed LAMB.
翻译:为了在数百个GPU上培训大型模型(如BERT和GPT-3),通信已成为一个主要瓶颈,特别是在具有有限带宽TCP网络的商品系统上。一方面,提出了大批量优化,如LAMB算算法,以减少通信频率。另一方面,1比特亚当等通信压缩算法有助于减少每部通信的量。然而,我们发现,仅仅使用其中一种技术不足以解决通信挑战,特别是在低网络带宽下。为此,我们的目标是将大批量优化和通信压缩的功能结合起来,但我们认为,由于LAMB的独特适应性层次学习率,现有的压缩战略不能直接适用于LAMB。为此,我们设计了新的通信效率算法,1比特1比特的LAMB,这为支持适应性层宽度学习率而提供了新的方法。此外,我们引入了一个新的系统,用NCLCLF的后端端点,这既能提高利用率和性能性能。但是,BERT-LA前级比值比值比值的比值比值比值比值比值比值为8K到NPLBPLA的比值后至比值,比值的比值比值的比值比值比值比值比值的比值比值到比值的比值,比值比值比值比值到比值到比值的比值到比值的比值到比值到比值的比值到比值到比值到比值的比值的比值到比值的比值到比值的比值到比值到比值的比值的比值的比值的比值的比值到比值的比值,比值到比值到比值的比值的比值到比值的比值到比值的比值的比值的比值到比值的比值的比值到比值到比值的比值的比值,比值到比值到比值到比值,比值到比值到比值到比值到比值到比值到比值的比值的比值的比值。