Heavily overparameterized language models such as BERT, XLNet and T5 have achieved impressive success in many NLP tasks. However, their high model complexity requires enormous computation resources and extremely long training time for both pre-training and fine-tuning. Many works have studied model compression on large NLP models, but only focusing on reducing inference time while still requiring an expensive training process. Other works use extremely large batch sizes to shorten the pre-training time, at the expense of higher computational resource demands. In this paper, inspired by the Early-Bird Lottery Tickets recently studied for computer vision tasks, we propose EarlyBERT, a general computationally-efficient training algorithm applicable to both pre-training and fine-tuning of large-scale language models. By slimming the self-attention and fully-connected sub-layers inside a transformer, we are the first to identify structured winning tickets in the early stage of BERT training. We apply those tickets towards efficient BERT training, and conduct comprehensive pre-training and fine-tuning experiments on GLUE and SQuAD downstream tasks. Our results show that EarlyBERT achieves comparable performance to standard BERT, with 35~45% less training time. Code is available at https://github.com/VITA-Group/EarlyBERT.
翻译:诸如BERT、XLNet和T5等高度超度语言模型在许多NLP任务中取得了令人印象深刻的成功。然而,其高模型复杂性要求大量的计算资源和极长的培训时间,用于培训前和微调。许多工作研究了大型NLP模型的模型压缩,但仅侧重于减少推断时间,同时仍然需要昂贵的培训程序。其他工作则使用极大批量的批量来缩短培训前时间,以牺牲较高的计算资源需求为代价。本文受最近为计算机愿景任务研究的早期Bird彩票的启发,我们提出AirthBERT, 一种适用于大规模语言模型的预培训和微调的一般的计算效率培训算法。通过在变异器内精简自我注意和完全连接的子层,我们首先在BERT培训的早期阶段确定结构化获奖的门票。我们将这些门票用于高效的BERT培训,并在GLUE和SUAD下游任务上进行全面的预先培训和微调实验。我们的成果显示,AREDRM_BRDRDRD在35号标准/Grodustryal培训中可以比较。