Knowledge distillation (KD), best known as an effective method for model compression, aims at transferring the knowledge of a bigger network (teacher) to a much smaller network (student). Conventional KD methods usually employ the teacher model trained in a supervised manner, where output labels are treated only as targets. Extending this supervised scheme further, we introduce a new type of teacher model for connectionist temporal classification (CTC)-based sequence models, namely Oracle Teacher, that leverages both the source inputs and the output labels as the teacher model's input. Since the Oracle Teacher learns a more accurate CTC alignment by referring to the target information, it can provide the student with more optimal guidance. One potential risk for the proposed approach is a trivial solution that the model's output directly copies the target input. Based on a many-to-one mapping property of the CTC algorithm, we present a training strategy that can effectively prevent the trivial solution and thus enables utilizing both source and target inputs for model training. Extensive experiments are conducted on two sequence learning tasks: speech recognition and scene text recognition. From the experimental results, we empirically show that the proposed model improves the students across these tasks while achieving a considerable speed-up in the teacher model's training time.
翻译:知识蒸馏(KD)是最被称为一种有效的模型压缩方法,其目的在于将更大的网络(教师)知识传授给一个更小的网络(学生)。常规的KD方法通常使用以监督方式培训的教师模式,这种模式只将输出标签作为目标对待。进一步推广这一监督的计划,我们引入一种新型的基于连接器时间分类(CTC)序列模型的教师模式,即甲骨文教师,这种模式将源投入和产出标签作为教师模式的投入,用来利用源投入和产出标签。由于甲骨文教师通过参考目标信息学习更精确的CT调整,它可以为学生提供更优化的指导。拟议方法的一个潜在风险是该模式输出直接复制目标投入的微不足道的解决方案。基于对立式计算机算法属性的多比一绘图,我们提出了一项培训战略,可以有效防止微不足道的解决方案,从而利用源和目标投入作为模式培训。在两个序列学习任务上进行了广泛的实验:语音识别和现场文字识别。从实验结果中,我们实验性地展示了拟议模型在完成这些任务时相当的速度改进学生。