Deep neural networks often have a huge number of parameters, which posts challenges in deployment in application scenarios with limited memory and computation capacity. Knowledge distillation is one approach to derive compact models from bigger ones. However, it has been observed that a converged heavy teacher model is strongly constrained for learning a compact student network and could make the optimization subject to poor local optima. In this paper, we propose ProKT, a new model-agnostic method by projecting the supervision signals of a teacher model into the student's parameter space. Such projection is implemented by decomposing the training objective into local intermediate targets with an approximate mirror descent technique. The proposed method could be less sensitive with the quirks during optimization which could result in a better local optimum. Experiments on both image and text datasets show that our proposed ProKT consistently achieves superior performance compared to other existing knowledge distillation methods.
翻译:深神经网络通常有许多参数,这些参数显示在应用情景中使用记忆和计算能力有限的内存和计算能力上存在挑战。知识蒸馏是从较大模型中产生紧凑模型的一种方法。然而,人们发现,趋同的重教师模式对于学习紧凑学生网络极为有限,可能使优化受制于当地差劲的选取。在本文中,我们提议了ProKT,这是一种将教师模型的监督信号投射到学生参数空间的新型模型-不可知性方法。这种预测是通过将培训目标分解成地方中间目标,并使用近似镜下沉技术来实现的。在优化期间,拟议方法可能不那么敏感,与可导致更好的当地最佳效果的奇科相比。关于图像和文本数据集的实验表明,我们提议的ProKT与其他现有的知识蒸馏方法相比,始终取得优异的绩效。