重新审视知识蒸馏：数据集规模的关键作用 (Revisiting Knowledge Distillation: The Hidden Role of Dataset Size)

The concept of knowledge distillation (KD) describes the training of a student model from a teacher model and is a widely adopted technique in deep learning. However, it is still not clear how and why distillation works. Previous studies focus on two central aspects of distillation: model size, and generalisation. In this work we study distillation in a third dimension: dataset size. We present a suite of experiments across a wide range of datasets, tasks and neural architectures, demonstrating that the effect of distillation is not only preserved but amplified in low-data regimes. We call this newly discovered property the data efficiency of distillation. Equipped with this new perspective, we test the predictive power of existing theories of KD as we vary the dataset size. Our results disprove the hypothesis that distillation can be understood as label smoothing, and provide further evidence in support of the dark knowledge hypothesis. Finally, we analyse the impact of modelling factors such as the objective, scale and relative number of samples on the observed phenomenon. Ultimately, this work reveals that the dataset size may be a fundamental but overlooked variable in the mechanisms underpinning distillation.

翻译：知识蒸馏（KD）描述了从教师模型训练学生模型的过程，是深度学习领域广泛采用的技术。然而，蒸馏机制的工作原理与有效性原因仍不明确。先前研究主要聚焦于蒸馏的两个核心维度：模型规模与泛化能力。本研究从第三个维度——数据集规模——对蒸馏进行系统性探究。我们通过跨数据集、跨任务、跨神经架构的系列实验证明，蒸馏效应在低数据量场景中不仅得以保持，而且显著增强。我们将这一新发现的性质称为蒸馏的数据效率特性。基于这一新视角，我们检验了现有KD理论在不同数据集规模下的预测能力。实验结果否定了"蒸馏可视为标签平滑"的假说，并为"暗知识假说"提供了进一步证据支持。最后，我们分析了目标函数、数据规模及样本相对数量等建模因素对观测现象的影响。本研究最终揭示：数据集规模可能是支撑蒸馏机制的基础性变量，但其重要性长期被忽视。