MixKD:促进高效地蒸馏大型语言模式 (MixKD: Towards Efficient Distillation of Large-scale Language Models)

Large-scale language models have recently demonstrated impressive empirical performance. Nevertheless, the improved results are attained at the price of bigger models, more power consumption, and slower inference, which hinder their applicability to low-resource (both memory and computation) platforms. Knowledge distillation (KD) has been demonstrated as an effective framework for compressing such big models. However, large-scale neural network systems are prone to memorize training instances, and thus tend to make inconsistent predictions when the data distribution is altered slightly. Moreover, the student model has few opportunities to request useful information from the teacher model when there is limited task-specific data available. To address these issues, we propose MixKD, a data-agnostic distillation framework that leverages mixup, a simple yet efficient data augmentation approach, to endow the resulting model with stronger generalization ability. Concretely, in addition to the original training examples, the student model is encouraged to mimic the teacher's behavior on the linear interpolation of example pairs as well. We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the generalization error and the empirical error. To verify its effectiveness, we conduct experiments on the GLUE benchmark, where MixKD consistently leads to significant gains over the standard KD training, and outperforms several competitive baselines. Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.

翻译：大型语言模型最近表现出了令人印象深刻的经验性表现。然而,成果的改善是以较大的模型、更多的电力消耗和较低的推论为代价的,这阻碍了这些模型对低资源(记忆和计算)平台的适用性。知识蒸馏(KD)已被证明是压缩这些大模型的有效框架。但是,大型神经网络系统容易对培训实例进行记忆化,因此在数据分布略有改变时往往作出不一致的预测。此外,学生模型在任务特定数据有限的情况下很难要求教师模型提供有用信息。为了解决这些问题,我们建议采用MixKD, 数据-Agnistic 蒸馏框架,利用混合这一简单而有效的数据增强方法,使由此产生的模型具有更强的概括性能力。具体地说,除了最初的培训实例外,学生模型鼓励将教师的行为模拟拟议的线性内插方法。我们从理论角度证明,在合理的条件下,MixKD具有有限的优势,在数据质量和实验性GIL基准下,我们持续地核查其基准性差,在GL基准下进行若干次标准差。我们从一个理论角度证明,在GIL标准级的实验性研究中,在GI级基准下,在GI级标准差和实验性测测测测测测测了一些一个比。