The advent of large pre-trained language models has given rise to rapid progress in the field of Natural Language Processing (NLP). While the performance of these models on standard benchmarks has scaled with size, compression techniques such as knowledge distillation have been key in making them practical. We present, MATE-KD, a novel text-based adversarial training algorithm which improves the performance of knowledge distillation. MATE-KD first trains a masked language model based generator to perturb text by maximizing the divergence between teacher and student logits. Then using knowledge distillation a student is trained on both the original and the perturbed training samples. We evaluate our algorithm, using BERT-based models, on the GLUE benchmark and demonstrate that MATE-KD outperforms competitive adversarial learning and data augmentation baselines. On the GLUE test set our 6 layer RoBERTa based model outperforms BERT-Large.
翻译:大量预先培训的语文模式的出现在自然语言处理(NLP)领域带来了迅速的进展。 虽然这些标准基准模型的绩效随着规模的大小而缩小,但诸如知识蒸馏等压缩技术是使其实用的关键。我们介绍了基于文本的新颖的对抗性培训算法,MATE-KD,它改进了知识蒸馏的性能。MATE-KD首先训练了一个基于隐蔽语言模式的生成器,通过最大限度地扩大教师与学生的登录差异来破坏文本。然后,利用知识蒸馏,对学生进行原始和四周培训样本的培训。我们利用基于BERT的模型,在GLUE基准上评估我们的算法,并证明MATE-KD超越了竞争性对抗性学习和数据增强基线。在GLUE测试中,我们基于6层的RBERTA模型比BERTE-Lange的模型。