Traditional knowledge distillation adopts a two-stage training process in which a teacher model is pre-trained and then transfers the knowledge to a compact student model. To overcome the limitation, online knowledge distillation is proposed to perform one-stage distillation when the teacher is unavailable. Recent researches on online knowledge distillation mainly focus on the design of the distillation objective, including attention or gate mechanism. Instead, in this work, we focus on the design of the global architecture and propose Tree-Structured Auxiliary online knowledge distillation (TSA), which adds more parallel peers for layers close to the output hierarchically to strengthen the effect of knowledge distillation. Different branches construct different views of the inputs, which can be the source of the knowledge. The hierarchical structure implies that the knowledge transfers from general to task-specific with the growth of the layers. Extensive experiments on 3 computer vision and 4 natural language processing datasets show that our method achieves state-of-the-art performance without bells and whistles. To the best of our knowledge, we are the first to demonstrate the effectiveness of online knowledge distillation for machine translation tasks.
翻译:传统知识蒸馏采用两阶段培训过程,教师模式先经过培训,然后将知识传授给紧凑的学生模式。为了克服这一限制,建议在线知识蒸馏在教师缺课时进行一阶段蒸馏。最近对在线知识蒸馏的研究主要侧重于蒸馏目标的设计,包括注意力或门机制。相反,在这项工作中,我们侧重于设计全球结构,并提议建立树木结构的在线辅助知识蒸馏(TSA),在接近产出的层层上增加更多的平行同级,以加强知识蒸馏的效果。不同的分支对投入提出了不同的观点,这可以成为知识的来源。等级结构意味着随着层层的增长,知识从一般转向具体任务。关于3个计算机视野和4个自然语言处理数据集的广泛实验表明,我们的方法在没有钟声和哨子的情况下取得了最新水平的艺术表现。我们最了解的是,我们首先展示了机器翻译在线知识蒸馏的有效性。