继续 KD: 通过继续优化的透镜改进知识蒸馏 (Continuation KD: Improved Knowledge Distillation through the Lens of Continuation Optimization)

Knowledge Distillation (KD) has been extensively used for natural language understanding (NLU) tasks to improve a small model's (a student) generalization by transferring the knowledge from a larger model (a teacher). Although KD methods achieve state-of-the-art performance in numerous settings, they suffer from several problems limiting their performance. It is shown in the literature that the capacity gap between the teacher and the student networks can make KD ineffective. Additionally, existing KD techniques do not mitigate the noise in the teacher's output: modeling the noisy behaviour of the teacher can distract the student from learning more useful features. We propose a new KD method that addresses these problems and facilitates the training compared to previous techniques. Inspired by continuation optimization, we design a training procedure that optimizes the highly non-convex KD objective by starting with the smoothed version of this objective and making it more complex as the training proceeds. Our method (Continuation-KD) achieves state-of-the-art performance across various compact architectures on NLU (GLUE benchmark) and computer vision tasks (CIFAR-10 and CIFAR-100).

翻译：知识蒸馏(KD)被广泛用于自然语言理解(NLU)任务,目的是通过从更大的模型(教师)传授知识来改进小型模型(学生)的普及化。虽然KD方法在许多环境中取得了最先进的表现,但它们遇到了若干限制其业绩的问题。文献显示,教师和学生网络的能力差距可以使KD失去效力。此外,现有的KD技术不能减轻教师产出中的噪音:模拟教师的吵闹行为可以分散学生学习更有用的特征的注意力。我们提出了一种新的KD方法,以解决这些问题并促进与以往技术相比的培训。在继续优化的激励下,我们设计了一个培训程序,从这一目标的平滑版本开始,随着培训的进行而使其更为复杂。我们的方法(继续-KD)在NLU(GLUE基准)和计算机视觉任务(CIFAR-10和CIFAR-100)的各种紧凑结构中达到了最先进的业绩。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日