In recent years, BERT has made significant breakthroughs on many natural language processing tasks and attracted great attentions. Despite its accuracy gains, the BERT model generally involves a huge number of parameters and needs to be trained on massive datasets, so training such a model is computationally very challenging and time-consuming. Hence, training efficiency should be a critical issue. In this paper, we propose a novel coarse-refined training framework named CoRe to speed up the training of BERT. Specifically, we decompose the training process of BERT into two phases. In the first phase, by introducing fast attention mechanism and decomposing the large parameters in the feed-forward network sub-layer, we construct a relaxed BERT model which has much less parameters and much lower model complexity than the original BERT, so the relaxed model can be quickly trained. In the second phase, we transform the trained relaxed BERT model into the original BERT and further retrain the model. Thanks to the desired initialization provided by the relaxed model, the retraining phase requires much less training steps, compared with training an original BERT model from scratch with a random initialization. Experimental results show that the proposed CoRe framework can greatly reduce the training time without reducing the performance.
翻译:近年来,BERT在许多自然语言处理任务上取得了重大突破,吸引了极大关注。尽管取得了准确性,BERT模式通常涉及大量参数,需要接受大规模数据集培训,因此,培训这种模型在计算上非常富有挑战性和耗时。因此,培训效率应是一个关键问题。在本文件中,我们提议了一个名为CORE的新的粗略、经过调整的培训框架,以加快对BERT的培训。具体地说,我们将BERT的培训进程分解为两个阶段。在第一阶段,通过引入快速关注机制和分解饲料前网络子层的大型参数,我们建造了一个宽松的BERT模型,该模型的参数比原BERT少得多,模型的复杂性要低得多,因此,该宽松模型可以很快地得到培训。在第二阶段,我们将经过培训的经过培训的放松的BERT模型转变为原始的BERT模型,并进一步对模型进行再培训。由于宽松模式所提供的理想初始化,再培训阶段需要的培训步骤要少得多,而培训的阶段则比最初的BERT模型从随机初始化而少得多。 实验结果显示拟议的业绩框架可以大大减少。