Conventional wisdom in pruning Transformer-based language models is that pruning reduces the model expressiveness and thus is more likely to underfit rather than overfit. However, under the trending pretrain-and-finetune paradigm, we postulate a counter-traditional hypothesis, that is: pruning increases the risk of overfitting when performed at the fine-tuning phase. In this paper, we aim to address the overfitting problem and improve pruning performance via progressive knowledge distillation with error-bound properties. We show for the first time that reducing the risk of overfitting can help the effectiveness of pruning under the pretrain-and-finetune paradigm. Ablation studies and experiments on the GLUE benchmark show that our method outperforms the leading competitors across different tasks.
翻译:以变异器为基础的语言模型的常规智慧是,修剪会减少模型的表达性,因此更可能降低模型的表达性,而不是过于完善。然而,在趋势式的先发制人和后发制人范式下,我们假设了一种反传统假设,即:修剪会增加在微调阶段进行时过分适应的风险。在本文中,我们的目标是通过使用错误特性的渐进性知识蒸馏来解决过于适应的问题,并通过渐进式知识蒸馏改进修剪性。我们第一次显示,降低过装风险有助于在先发制人和后发制人范式下进行修剪裁的功效。GLUE基准的减缩研究和实验表明,我们的方法超越了不同任务中的主要竞争者。