This work investigates the compatibility between label smoothing (LS) and knowledge distillation (KD). Contemporary findings addressing this thesis statement take dichotomous standpoints: Muller et al. (2019) and Shen et al. (2021b). Critically, there is no effort to understand and resolve these contradictory findings, leaving the primal question -- to smooth or not to smooth a teacher network? -- unanswered. The main contributions of our work are the discovery, analysis and validation of systematic diffusion as the missing concept which is instrumental in understanding and resolving these contradictory findings. This systematic diffusion essentially curtails the benefits of distilling from an LS-trained teacher, thereby rendering KD at increased temperatures ineffective. Our discovery is comprehensively supported by large-scale experiments, analyses and case studies including image classification, neural machine translation and compact student distillation tasks spanning across multiple datasets and teacher-student architectures. Based on our analysis, we suggest practitioners to use an LS-trained teacher with a low-temperature transfer to achieve high performance students. Code and models are available at https://keshik6.github.io/revisiting-ls-kd-compatibility/
翻译:这项工作调查了标签平滑(LS)和知识蒸馏(KD)之间的兼容性。关于这一理论陈述的当代发现采取了二分法的观点:Muller等人(2019年)和Shen等人(2021b)。关键是,没有努力理解和解决这些相互矛盾的结论,留下原始问题 -- -- 平滑或不平滑教师网络? -- -- 没有答案。我们工作的主要贡献是发现、分析和验证系统传播,作为缺失的概念,有助于理解和解决这些相互矛盾的结论。这种系统传播基本上限制了LS培训教师的蒸馏效益,从而使KD在温度升高时无效。我们的发现得到大规模实验、分析和案例研究的全面支持,包括图像分类、神经机翻译和学生压缩蒸馏任务,跨越多个数据集和师范结构。根据我们的分析,我们建议从业者使用受过LS培训的教师,低调调派到高性能学生。在https://keshibusilt./complical-complis-qivio/reititting)。