Recent research has found that knowledge distillation can be effective in reducing the size of a network and in increasing generalization. A pre-trained, large teacher network, for example, was shown to be able to bootstrap a student model that eventually outperforms the teacher in a limited label environment. Despite these advances, it still is relatively unclear \emph{why} this method works, that is, what the resulting student model does 'better'. To address this issue, here, we utilize two non-linear, low-dimensional embedding methods (t-SNE and IVIS) to visualize representation spaces of different layers in a network. We perform a set of extensive experiments with different architecture parameters and distillation methods. The resulting visualizations and metrics clearly show that distillation guides the network to find a more compact representation space for higher accuracy already in earlier layers compared to its non-distilled version.
翻译:最近的研究发现,知识蒸馏在缩小网络规模和增加普及方面是有效的。例如,经过预先训练的大型教师网络被证明能够将最终在有限标签环境中优于教师的学生模型套牢。尽管取得了这些进步,但这种方法仍然相对不清楚,即由此产生的学生模型“做得更好”是什么样的。为了解决这个问题,我们使用两种非线性、低维嵌入法(t-SNE和IVIS)将网络不同层的演示空间(t-SNE和IVIS)可视化。我们用不同的建筑参数和蒸馏方法进行了一系列广泛的实验。由此产生的可视化和测量清楚显示,蒸馏法引导网络找到比非蒸馏版更精准的更紧凑的展示空间。