Knowledge distillation (KD) has recently emerged as an efficacious scheme for learning compact deep neural networks (DNNs). Despite the promising results achieved, the rationale that interprets the behavior of KD has yet remained largely understudied. In this paper, we introduce a novel task-oriented attention model, termed as KDExplainer, to shed light on the working mechanism underlying the vanilla KD. At the heart of KDExplainer is a Hierarchical Mixture of Experts (HME), in which a multi-class classification is reformulated as a multi-task binary one. Through distilling knowledge from a free-form pre-trained DNN to KDExplainer, we observe that KD implicitly modulates the knowledge conflicts between different subtasks, and in reality has much more to offer than label smoothing. Based on such findings, we further introduce a portable tool, dubbed as virtual attention module (VAM), that can be seamlessly integrated with various DNNs to enhance their performance under KD. Experimental results demonstrate that with a negligible additional cost, student models equipped with VAM consistently outperform their non-VAM counterparts across different benchmarks. Furthermore, when combined with other KD methods, VAM remains competent in promoting results, even though it is only motivated by vanilla KD.
翻译:知识蒸馏(KD)最近成为学习紧凑的深神经网络(DNN)的有效机制。 尽管取得了有希望的成果,解释KD行为的理由仍然基本上没有得到充分研究。 在本文件中,我们引入了一个新的以任务为导向的关注模式,称为 KDExplainer, 以揭示香草KD背后的工作机制。 在 KDExplainer 的核心, KDExplainer 是专家的分级混合(HME ), 多级分类被重订为多任务二进制。 实验结果通过从自由制预培训的DNNN到 KDExplainer中提取知识,我们发现KD隐含地调节不同子任务和现实中的知识冲突,比标签平滑还要多得多。 基于这些发现,我们进一步引入了一种被标为虚拟关注模块(VAM)的便携式工具,可以与各种DNNNS结合,以提高KD下的业绩。 实验结果表明,通过一个微不足道的附加成本,即使不具有可观的VAM的VAM(VAM)学生模型, 也一直以其他的相同的方式,在VAM(VAM(VAM)的相同)的相同的对比模型中,只有微小的VAM(VAM(VA)学生模型的混合的混合的)模式。