Training large transformer models is one of the most important computational challenges of modern AI. In this paper, we show how to significantly accelerate training of large transformer models by reducing activation recomputation. Activation recomputation is commonly used to work around memory capacity constraints. Rather than storing activations for backpropagation, they are traditionally recomputed, which saves memory but adds redundant compute. In this work, we show most of this redundant compute is unnecessary because we can reduce memory consumption sufficiently without it. We present two novel yet very simple techniques: sequence parallelism and selective activation recomputation. In conjunction with tensor parallelism, these techniques almost eliminate the need to recompute activations. We evaluate our approach on language models up to one trillion parameters in scale and show that our method reduces activation memory by 5x, while reducing execution time overhead from activation recomputation by over 90%. For example, when training a 530B parameter GPT-3 style model on 2240 NVIDIA A100 GPUs, we achieve a Model Flops Utilization of 54.2%, which is 29% faster than the 42.1% we achieve using recomputation. Our implementation will be available in both Megatron-LM and NeMo-Megatron.
翻译:培训大型变压器模型是现代 AI 最重要的计算挑战之一。 在本文中, 我们展示了如何通过减少激活重计来大大加速大型变压器模型的培训。 激活重计通常用于围绕记忆能力限制进行工作。 我们通常使用激活力重计, 而不是存储反向调整的启动力, 而是在传统上进行重新计算, 从而节省记忆力, 但却增加了多余的计算力。 在这项工作中, 我们展示了大部分这种冗余的计算力是不必要的, 因为我们可以在不使用它的情况下充分减少记忆消耗。 我们展示了两种新颖而非常简单的技术: 序列平行和选择性激活再转换。 结合 与 数组平行性, 这些技术几乎消除了重新配置激活的必要性 。 我们评估了语言模型在规模上高达一万亿参数上的方法, 并表明我们的方法将激活记忆力减小了5x, 同时将执行时间从激活力重计90%以上。 例如, 当我们在2240 NVIDIA 100 GUPS 上培训一个 530B 参数 GPT-3 模式时, 我们就能实现54. 29%的模型, 和MD.