Online Knowledge Distillation (OKD) improves the involved models by reciprocally exploiting the difference between teacher and student. Several crucial bottlenecks over the gap between them -- e.g., Why and when does a large gap harm the performance, especially for student? How to quantify the gap between teacher and student? -- have received limited formal study. In this paper, we propose Switchable Online Knowledge Distillation (SwitOKD), to answer these questions. Instead of focusing on the accuracy gap at test phase by the existing arts, the core idea of SwitOKD is to adaptively calibrate the gap at training phase, namely distillation gap, via a switching strategy between two modes -- expert mode (pause the teacher while keep the student learning) and learning mode (restart the teacher). To possess an appropriate distillation gap, we further devise an adaptive switching threshold, which provides a formal criterion as to when to switch to learning mode or expert mode, and thus improves the student's performance. Meanwhile, the teacher benefits from our adaptive switching threshold and keeps basically on a par with other online arts. We further extend SwitOKD to multiple networks with two basis topologies. Finally, extensive experiments and analysis validate the merits of SwitOKD for classification over the state-of-the-arts. Our code is available at https://github.com/hfutqian/SwitOKD.
翻译:在线知识蒸馏( OKD) 通过相互利用教师和学生之间的差别来改善参与模式。 有关他们之间差距的几个关键瓶颈 -- -- 例如, 为什么和何时出现巨大差距会损害业绩, 特别是学生的绩效? 如何量化师生之间的差距? 获得的正式研究有限。 本文中,我们提出可移植在线知识蒸馏( SwitOKD), 以回答这些问题。 SwitOKD的核心理念不是以现有艺术的测试阶段的准确性差距为重点, 而是通过两种模式 -- -- 专家模式( 将教师在保持学生学习的同时加插教师)和学习模式( 重新开始教师) -- -- 之间的转换战略来适应调整培训阶段的差距, 即蒸馏差距。 为了拥有适当的蒸馏差距,我们进一步设计了适应性转换门槛( Swital) 标准, 以何时转换到学习模式或专家模式, 从而改善学生的成绩。 与此同时, 我们的适应性转换门槛( ) 和基本保持与其他在线艺术的平衡。 我们进一步将SwitOKD- D 扩大到多种网络, 的高级分析。