Knowledge distillation (KD) which transfers the knowledge from a large teacher model to a small student model, has been widely used to compress the BERT model recently. Besides the supervision in the output in the original KD, recent works show that layer-level supervision is crucial to the performance of the student BERT model. However, previous works designed the layer mapping strategy heuristically (e.g., uniform or last-layer), which can lead to inferior performance. In this paper, we propose to use the genetic algorithm (GA) to search for the optimal layer mapping automatically. To accelerate the search process, we further propose a proxy setting where a small portion of the training corpus are sampled for distillation, and three representative tasks are chosen for evaluation. After obtaining the optimal layer mapping, we perform the task-agnostic BERT distillation with it on the whole corpus to build a compact student model, which can be directly fine-tuned on downstream tasks. Comprehensive experiments on the evaluation benchmarks demonstrate that 1) layer mapping strategy has a significant effect on task-agnostic BERT distillation and different layer mappings can result in quite different performances; 2) the optimal layer mapping strategy from the proposed search process consistently outperforms the other heuristic ones; 3) with the optimal layer mapping, our student model achieves state-of-the-art performance on the GLUE tasks.
翻译:将知识从一个大型教师模型转移到一个小型学生模型的知识蒸馏(KD)最近被广泛用于压缩BERT模型。除了对原KD产出的监督外,最近的工作显示,层级监督对于学生BERT模型的性能至关重要。然而,以前的工作设计了可导致低级业绩的层制图战略(例如,统一或最后一层),在本文件中,我们提议使用基因算法(GA)自动寻找最佳层绘图。为加快搜索进程,我们进一步提议一个代理设置,其中一小部分培训材料被抽样用于蒸馏,并选择了三项具有代表性的任务来进行评估。在获得最佳层绘图模型后,我们与它一起执行任务-无意识的BERT蒸馏法,以建立一个紧凑合的学生模型,可以直接调整下游任务。关于评估基准的全面实验表明,1层绘图战略对任务-无前科的BERT蒸馏和不同层测绘模型具有显著影响。在相当不同的业绩中,他选择了三个有代表性的任务。在获得最佳层绘图后,他提出了最佳的层测绘战略。