Scalable training of large models (like BERT and GPT-3) requires careful optimization rooted in model design, architecture, and system capabilities. From a system standpoint, communication has become a major bottleneck, especially on commodity systems with standard TCP interconnects that offer limited network bandwidth. Communication compression is an important technique to reduce training time on such systems. One of the most effective methods is error-compensated compression, which offers robust convergence speed even under 1-bit compression. However, state-of-the-art error compensation techniques only work with basic optimizers like SGD and momentum SGD, which are linearly dependent on the gradients. They do not work with non-linear gradient-based optimizers like Adam, which offer state-of-the-art convergence efficiency and accuracy for models like BERT. In this paper, we propose 1-bit Adam that reduces the communication volume by up to $5\times$, offers much better scalability, and provides the same convergence speed as uncompressed Adam. Our key finding is that Adam's variance (non-linear term) becomes stable (after a warmup phase) and can be used as a fixed precondition for the rest of the training (compression phase). Experiments on up to 256 GPUs show that 1-bit Adam enables up to $3.3\times$ higher throughput for BERT-Large pre-training and up to $2.9\times$ higher throughput for SQuAD fine-tuning. In addition, we provide theoretical analysis for our proposed work.
翻译:大型模型(如BERT和GPT-3)的可扩缩培训需要基于模型设计、架构和系统能力的审慎优化。 从系统角度看,通信已经成为一个主要瓶颈,特别是在具有标准的TCP连接网络带带带有限带宽的商品系统上。通信压缩是缩短这类系统培训时间的重要技术。最有效的方法之一是校准错误压缩,它提供强力的趋同速度,即使是在1位压缩下也提供强力的趋同速度。然而,最先进的错误补偿技术只与基本优化者如SGD和势头SGD等依靠梯度线性调整的基本优化者一起工作。它们不与亚当等非线性梯度优化者一起工作,它们提供最先进的趋同效率和准确性。在本文件中,我们建议用1位亚当来将通信量减低至5美元,提供更好的缩放宽度,并且提供与不压成价的达标值的趋同速度。我们的主要发现是,Adam的变差(非线性任期)会稳定(经过一个暖阶段后)和精确的梯度优化精度优化优化优化优化优化优化化阶段。 3 将S25ADRADADA 用于一个固定阶段。