We propose a simple and efficient approach for training the BERT model. Our approach exploits the special structure of BERT that contains a stack of repeated modules (i.e., transformer encoders). Our proposed approach first trains BERT with the weights shared across all the repeated modules till some point. This is for learning the commonly shared component of weights across all repeated layers. We then stop weight sharing and continue training until convergence. We present theoretic insights for training by sharing weights then unsharing with analysis for simplified models. Empirical experiments on the BERT model show that our method yields better performance of trained models, and significantly reduces the number of training iterations.
翻译:我们提出了培训BERT模式的简单而有效的方法。我们的方法利用了BERT的特殊结构,该结构包含一系列重复模块(即变压器编码器)。我们提出的方法首先将所有重复模块的重量共享到某一点。这是用来学习所有重复模块共有的重量部分。然后,我们停止重量共享,继续培训,直到趋同。我们提出培训理论见解,方法是分享重量,然后不与简化模型的分析分享。关于BERT模式的经验实验表明,我们的方法可以提高经过培训的模型的性能,并大大减少培训迭代的数量。