Fine-tuning transformer models after unsupervised pre-training reaches a very high performance on many different NLP tasks. Unfortunately, transformers suffer from long inference times which greatly increases costs in production and is a limiting factor for the deployment into embedded devices. One possible solution is to use knowledge distillation, which solves this problem by transferring information from large teacher models to smaller student models, but as it needs an additional expensive pre-training phase, this solution is computationally expensive and can be financially prohibitive for smaller academic research groups. Another solution is to use layer-wise pruning methods, which reach high compression rates for transformer models and avoids the computational load of the pre-training distillation stage. The price to pay is that the performance of layer-wise pruning algorithms is not on par with state-of-the-art knowledge distillation methods. In this paper, greedy layer pruning (GLP) is introduced to (1) outperform current state-of-the-art for layer-wise pruning (2) close the performance gap when compared to knowledge distillation, while (3) using only a modest budget. More precisely, with the methodology presented it is possible to prune and evaluate competitive models on the whole GLUE benchmark with a budget of just $\$300$. Our source code is available on https://github.com/deepopinion/greedy-layer-pruning.
翻译:在未经监督的预培训阶段之后,微调变压器模型在很多不同的NLP任务上达到非常高的性能。 不幸的是,变压器经历了长时间的推算时间,极大地提高了生产成本,并且是内嵌设备部署的一个限制因素。一个可能的解决办法是使用知识蒸馏法,通过将信息从大型教师模型转移到较小的学生模型来解决该问题,但是由于它需要额外的昂贵的预培训阶段,这种解决办法在计算上是昂贵的,对较小的学术研究团体来说在财政上可能令人望而却步。另一个解决办法是使用从层到层到层的调整方法,这些方法达到变压器模型的高压缩率,并避免了培训前蒸馏阶段的计算负荷。要付出的代价是,从层到精细的理算算算算法的性能不等同于最先进的知识蒸馏方法。在本文中,贪婪的层平流(GLP) 引入了(1) 低于当前的水平,用于分层/调整的理算法;(2) 在与知识蒸馏中比较时,缩小业绩差距,而只是使用一种微级的GL 和整个预算的源。 可能展示一种小的基 。