Transformer-based NLP models are trained using hundreds of millions or even billions of parameters, limiting their applicability in computationally constrained environments. While the number of parameters generally correlates with performance, it is not clear whether the entire network is required for a downstream task. Motivated by the recent work on pruning and distilling pre-trained models, we explore strategies to drop layers in pre-trained models, and observe the effect of pruning on downstream GLUE tasks. We were able to prune BERT, RoBERTa and XLNet models up to 40%, while maintaining up to 98% of their original performance. Additionally we show that our pruned models are on par with those built using knowledge distillation, both in terms of size and performance. Our experiments yield interesting observations such as, (i) the lower layers are most critical to maintain downstream task performance, (ii) some tasks such as paraphrase detection and sentence similarity are more robust to the dropping of layers, and (iii) models trained using a different objective function exhibit different learning patterns and w.r.t the layer dropping.
翻译:以变压器为基础的NLP模型使用数亿甚至数十亿参数进行培训,限制其在计算受限环境中的适用性。 虽然参数数量通常与性能相关,但尚不清楚整个网络是否为下游任务所需要。受最近关于预选模型的裁剪和蒸馏工作推动,我们探索了在预培训模型中降低层次的战略,并观察了对下游GLUE任务进行排练的影响。我们得以将BERT、RoBERTA和XLNet模型推到40%以下,同时保持其原有性能的98%。此外,我们还表明,我们经调整的模型与利用知识蒸馏而建立的模型在规模和性能方面是完全相同的。我们的实验产生了有趣的观测结果,例如:(一) 低层对于维持下游任务性能最为关键,(二) 诸如参数探测和句子相似的一些任务比下游任务更强大,而与层的下降更为有力,以及(三) 使用不同客观功能培训的模型显示不同的学习模式有不同的学习模式和水平下降。