Recent advances in distilling pretrained language models have discovered that, besides the expressiveness of knowledge, the student-friendliness should be taken into consideration to realize a truly knowledgable teacher. Based on a pilot study, we find that over-parameterized teachers can produce expressive yet student-unfriendly knowledge and are thus limited in overall knowledgableness. To remove the parameters that result in student-unfriendliness, we propose a sparse teacher trick under the guidance of an overall knowledgable score for each teacher parameter. The knowledgable score is essentially an interpolation of the expressiveness and student-friendliness scores. The aim is to ensure that the expressive parameters are retained while the student-unfriendly ones are removed. Extensive experiments on the GLUE benchmark show that the proposed sparse teachers can be dense with knowledge and lead to students with compelling performance in comparison with a series of competitive baselines.
翻译:最近在精练语言模式的蒸馏方面取得的进步发现,除了知识的表达性外,还应考虑到学生的友好性,以便实现一位真正可知的教师。根据一项试点研究,我们发现,过度分数的教师能够产生显性但学生不友好的知识,因此在总体可知性方面受到限制。为了消除导致学生不友好的参数,我们建议在每个教师参数的总体可知分数的指导下,采用稀疏的教师伎俩。了解性分数基本上是表达性和学生友好性分数的内推法。目的是确保保留表达性参数,同时消除学生不友好性分数。关于GLUE基准的广泛实验表明,拟议的稀疏教师可以聚集知识,并导致与一系列竞争性基线相比有令人信服的表现的学生。