Knowledge distillation (KD) is a highly promising method for mitigating the computational problems of pre-trained language models (PLMs). Among various KD approaches, Intermediate Layer Distillation (ILD) has been a de facto standard KD method with its performance efficacy in the NLP field. In this paper, we find that existing ILD methods are prone to overfitting to training datasets, although these methods transfer more information than the original KD. Next, we present the simple observations to mitigate the overfitting of ILD: distilling only the last Transformer layer and conducting ILD on supplementary tasks. Based on our two findings, we propose a simple yet effective consistency-regularized ILD (CR-ILD), which prevents the student model from overfitting the training dataset. Substantial experiments on distilling BERT on the GLUE benchmark and several synthetic datasets demonstrate that our proposed ILD method outperforms other KD techniques. Our code is available at https://github.com/jongwooko/CR-ILD.
翻译:知识蒸馏(KD)是缓解预先培训的语言模型(PLM)的计算问题的一个极有希望的方法。在各种KD方法中,中间层蒸馏(ILD)是一种事实上的标准KD方法,其性能在NLP领域是有效的。在本文中,我们发现现有的ILD方法容易过分适应培训数据集,尽管这些方法转移的信息比原始KD多。接下来,我们提出简单的意见,以缓解IDD的过度配置:只蒸馏最后的变异器层和进行ILD的补充任务。根据我们的两项调查结果,我们建议采用一种简单而有效的一致性正规化的ILD(CR-ILD)方法,防止学生模型过度配置培训数据集。关于在GLUE基准和几个合成数据集上提炼BERT的大量实验表明,我们提议的ILD方法超越了其他KD技术。我们的代码可在 https://github.com/jongwonoko/CR-ILD上查阅。