Knowledge distillation transfers the knowledge from a cumbersome teacher to a small student. Recent results suggest that the student-friendly teacher is more appropriate to distill since it provides more transferable knowledge. In this work, we propose the novel framework, "prune, then distill," that prunes the model first to make it more transferrable and then distill it to the student. We provide several exploratory examples where the pruned teacher teaches better than the original unpruned networks. We further show theoretically that the pruned teacher plays the role of regularizer in distillation, which reduces the generalization error. Based on this result, we propose a novel neural network compression scheme where the student network is formed based on the pruned teacher and then apply the "prune, then distill" strategy. The code is available at https://github.com/ososos888/prune-then-distill
翻译:知识蒸馏将知识从一个烦琐的教师传授给一个小学生。 最近的结果显示, 方便学生的教师更适合蒸馏, 因为它提供了更多的可转让的知识。 在此工作中, 我们提议了一个新颖的框架, “ 蒸馏, 然后蒸馏 ”, 将模型先提纯, 使其更易转让, 然后将它蒸馏给学生。 我们提供了几个探索性的例子, 被修剪的教师比原未修剪的网络教书更好。 我们从理论上进一步显示, 修剪的教师在蒸馏中扮演了正规化师的角色, 从而减少了一般化错误。 基于这个结果, 我们提出一个新的神经网络压缩计划, 学生网络以经修剪的教师为基础形成, 然后应用“ 蒸馏” 战略。 这个代码可以在 https://github. com/sosos.888/prune-th- distill查阅。