Knowledge distillation is an approach to transfer information on representations from a teacher to a student by reducing their difference. A challenge of this approach is to reduce the flexibility of the student's representations inducing inaccurate learning of the teacher's knowledge. To resolve it in transferring, we investigate distillation of structures of representations specified to three types: intra-feature, local inter-feature, global inter-feature structures. To transfer them, we introduce feature structure distillation methods based on the Centered Kernel Alignment, which assigns a consistent value to similar features structures and reveals more informative relations. In particular, a memory-augmented transfer method with clustering is implemented for the global structures. The methods are empirically analyzed on the nine tasks for language understanding of the GLUE dataset with Bidirectional Encoder Representations from Transformers (BERT), which is a representative neural language model. In the results, the proposed methods effectively transfer the three types of structures and improve performance compared to state-of-the-art distillation methods. Indeed, the code for the methods is available in https://github.com/maroo-sky/FSD.
翻译:知识蒸馏是一种方法,通过缩小差异,将关于教师陈述的信息从教师向学生传递。这种方法的一个挑战是降低学生陈述的灵活性,从而导致不准确地学习教师的知识。为了在转让中解决这个问题,我们调查对三种类型的代表结构的提炼:内地、本地间地物、全球间地物结构。为了转移这些结构,我们采用了基于中心对齐的特性结构蒸馏方法,该方法赋予类似特征结构以一致的价值,并揭示了更多的信息关系。特别是,对全球结构实行了记忆增强的聚合转移方法。对方法进行了实验分析,以了解GLUE数据集的九项语言任务,即由变形器(BERT)的双向电解密显示(BERT),这是一个具有代表性的神经语言模型。在结果中,拟议的方法有效地转移了三种类型的结构,并改进了与最新蒸馏方法的性能。事实上,方法的代码可在 https://github.com/maroo/FSSI/FSFS中查阅。</s>