Large pre-trained language models have achieved state-of-the-art results on a variety of downstream tasks. Knowledge Distillation (KD) of a smaller student model addresses their inefficiency, allowing for deployment in resource-constraint environments. KD however remains ineffective, as the student is manually selected from a set of existing options already pre-trained on large corpora, a sub-optimal choice within the space of all possible student architectures. This paper proposes KD-NAS, the use of Neural Architecture Search (NAS) guided by the Knowledge Distillation process to find the optimal student model for distillation from a teacher, for a given natural language task. In each episode of the search process, a NAS controller predicts a reward based on a combination of accuracy on the downstream task and latency of inference. The top candidate architectures are then distilled from the teacher on a small proxy set. Finally the architecture(s) with the highest reward is selected, and distilled on the full downstream task training set. When distilling on the MNLI task, our KD-NAS model produces a 2 point improvement in accuracy on GLUE tasks with equivalent GPU latency with respect to a hand-crafted student architecture available in the literature. Using Knowledge Distillation, this model also achieves a 1.4x speedup in GPU Latency (3.2x speedup on CPU) with respect to a BERT-Base Teacher, while maintaining 97% performance on GLUE Tasks (without CoLA). We also obtain an architecture with equivalent performance as the hand-crafted student model on the GLUE benchmark, but with a 15% speedup in GPU latency (20% speedup in CPU latency) and 0.8 times the number of parameters
翻译:大型预训练语言模型在各种下游任务上已经实现了最先进的结果。较小的学生模型的知识蒸馏(KD)解决了它们的低效性,允许在资源受限环境中部署。然而,KD仍然是无效的,因为学生是从已经在大型语料库上进行了预训练的一组现有选项中手动选择的,这是所有可能的学生架构中的次优选择。本文提出KD-NAS,即在知识蒸馏过程中引导神经架构搜索(NAS)以寻找给定自然语言任务的优化学生模型从老师模型中蒸馏。在每个搜索过程的每个周期中,NAS控制器基于下游任务上的准确度和推理延迟的组合预测奖励。然后在一个小的代理集上蒸馏出得分最高的候选体系结构。最后,在完整的下游任务训练集上选择具有最高奖励的架构进行蒸馏。当在MNLI任务上进行蒸馏时,我们的KD-NAS模型在等效GPU延迟的情况下在GLUE任务上提高了2个点的准确率,与已有的手工制作学生结构相比。使用知识蒸馏,该模型在不考虑CoLA的情况下在GLUE任务上保持97%的性能,并在GPU延迟上实现了1.4倍的加速(在CPU上实现了3.2倍的加速)。我们还在GLUE基准测试中获得了与手工制作的学生模型相同性能的架构,但GPU延迟加速度达到了15%(CPU延迟加速度达到了20%),参数数量为0.8倍