Pre-training models are an important tool in Natural Language Processing (NLP), while the BERT model is a classic pre-training model whose structure has been widely adopted by followers. It was even chosen as the reference model for the MLPerf training benchmark. The distributed training performance optimization of BERT models plays an important role in accelerating the solutions of most NLP tasks. BERT model often uses padding tensors as its inputs, leading to excessive redundant computations. Thus, removing these redundant computations is essential to improve the distributed training performance. This paper designs a new approach to train BERT models with variable-length inputs efficiently. Firstly, we propose a general structure for the variable-length BERT models, and accelerate the encoder layer via our grouped multi-stream FMHA (Fused Multi-Head Attention) method. Secondly, through data exchange, we address the unbalanced workload problem caused by the variable-length inputs, which overlaps highly with the training process. Finally, we optimize the overall performance of the BERT model, such as kernel fusion, and operator optimization. Our experimental results show that our highly optimized BERT model achieves state-of-the-art throughput and ranks first in MLPerf Training v2.0 within the same GPU configuration. The optimizations in this paper can be applied to more BERT-like models in our future works.
翻译:培训前模式是自然语言处理(NLP)的一个重要工具,而BERT模式是典型的培训前模式,其结构已被追随者广泛采用,甚至被选为MLPerf培训基准的参考模式。BERT模式分布式培训绩效优化在加速解决大多数NLP任务方面起着重要作用。BERT模式通常使用挂挂式计数器作为投入,导致过度重复计算。因此,消除这些多余的计算对于改进分布式培训绩效至关重要。本文设计了一种新的方法,用可变长的投入来培训BERT模式。首先,我们为可变长的BERT模式提出了一个总体结构,并通过我们组合的多流式FMHA(使用多发式关注)方法加速编码层。第二,通过数据交换,我们解决因可变长投入造成的不平衡的工作量问题,这与培训进程高度重叠。最后,我们优化了BERT模式的总体绩效,例如核心融合,以及操作者优化。我们的第一个实验结果表明,我们高优化的BERP-P 培训模式在将来的纸质化模型中可以实现高优化的GPU2。