Knowledge distillation is an approach to transfer information on representations from a teacher to a student by reducing their difference. A challenge of this approach is to reduce the flexibility of the student's representations inducing inaccurate learning of the teacher's knowledge. To resolve it in BERT transferring, we investigate distillation of structures of representations specified to three types: intra-feature, local inter-feature, global inter-feature structures. To transfer them, we introduce \textit{feature structure distillation} methods based on the Centered Kernel Alignment, which assigns a consistent value to similar features structures and reveals more informative relations. In particular, a memory-augmented transfer method with clustering is implemented for the global structures. In the experiments on the nine tasks for language understanding of the GLUE dataset, the proposed methods effectively transfer the three types of structures and improve performance compared to state-of-the-art distillation methods. Indeed, the code for the methods is available in https://github.com/maroo-sky/FSD
翻译:知识蒸馏是一种方法,通过缩小差异,将关于教师陈述的信息从教师向学生传递到表述的信息。这种方法的一个挑战是降低学生陈述的灵活性,从而导致不准确地学习教师的知识。为了在BERT转让中解决这一问题,我们调查将描述结构精炼为三种类型的:内地、本地间地物结构、全球间地物结构。为了转移这些结构,我们采用了基于Cernel对齐中心的Textit{地物结构蒸馏}方法,该方法赋予类似特征结构以一致的价值,并揭示了更多的信息关系。特别是,为全球结构采用了一种以群集为主的记忆强化传输方法。在对GLUE数据集语言理解的九项任务的实验中,拟议方法有效地转移了三种类型的结构,并改进了与最新蒸馏方法的性能。事实上,这些方法的代码可在http://github.com/maroo-sky/FSD中查阅。