Knowledge distillation field delicately designs various types of knowledge to shrink the performance gap between compact student and large-scale teacher. These existing distillation approaches simply focus on the improvement of \textit{knowledge quality}, but ignore the significant influence of \textit{knowledge quantity} on the distillation procedure. Opposed to the conventional distillation approaches, which extract knowledge from a fixed teacher computation graph, this paper explores a non-negligible research direction from a novel perspective of \textit{knowledge quantity} to further improve the efficacy of knowledge distillation. We introduce a new concept of knowledge decomposition, and further put forward the \textbf{P}artial to \textbf{W}hole \textbf{K}nowledge \textbf{D}istillation~(\textbf{PWKD}) paradigm. Specifically, we reconstruct teacher into weight-sharing sub-networks with same depth but increasing channel width, and train sub-networks jointly to obtain decomposed knowledge~(sub-networks with more channels represent more knowledge). Then, student extract partial to whole knowledge from the pre-trained teacher within multiple training stages where cyclic learning rate is leveraged to accelerate convergence. Generally, \textbf{PWKD} can be regarded as a plugin to be compatible with existing offline knowledge distillation approaches. To verify the effectiveness of \textbf{PWKD}, we conduct experiments on two benchmark datasets:~CIFAR-100 and ImageNet, and comprehensive evaluation results reveal that \textbf{PWKD} consistently improve existing knowledge distillation approaches without bells and whistles.
翻译:知识蒸馏字段精细地设计了各种类型的知识,以缩小紧凑学生与大比例教师之间的业绩差距。这些现有的蒸馏方法只是侧重于改进\ textit{ 知识质量} 。 我们引入了一个新的知识分解概念, 并进一步将全程的DNA推向\ textbf{W} 眼球\ textbf{K}K}nowledgege\ textb{D} 蒸馏~(从固定教师计算图表中提取知识的常规蒸馏方法), 本文探索了一种不可忽略的研究方向。 具体而言, 我们从新的\ text{ 知识分流{ 线} 来进一步提高知识蒸馏效率。 我们引入了一个新的知识分解概念, 并且进一步推介了全程的D( 子) 数据分流, 以更连续的 校正的 方法, 将现有的 校正的 校正的, 和 正在不断 的 校正的校正的校正的 。