Very deep models for speaker recognition (SR) have demonstrated remarkable performance improvement in recent research. However, it is impractical to deploy these models for on-device applications with constrained computational resources. On the other hand, light-weight models are highly desired in practice despite their sub-optimal performance. This research aims to improve light-weight SR models through large-scale label-free knowledge distillation (KD). Existing KD approaches for SR typically require speaker labels to learn task-specific knowledge, due to the inefficiency of conventional loss for distillation. To address the inefficiency problem and achieve label-free KD, we propose to employ the contrastive loss from self-supervised learning for distillation. Extensive experiments are conducted on a collection of public speech datasets from diverse sources. Results on light-weight SR models show that the proposed approach of label-free KD with contrastive loss consistently outperforms both conventional distillation methods and self-supervised learning methods by a significant margin.
翻译:在最近的研究中,非常深的语音识别模型(SR)显示出了显著的绩效改进;然而,在有限的计算资源下,将这些模型用于在设备上的应用,是不切实际的;另一方面,轻量模型尽管表现不理想,但在实践中非常理想;这项研究的目的是通过大规模无标签知识蒸馏(KD)来改进轻量的语音识别模型。由于传统的蒸馏损失效率低下,现有的语音识别方法通常要求用语言标签学习特定任务的知识。为了解决效率低下的问题,实现无标签的KD,我们提议采用自我监督的蒸馏学习带来的对比性损失。在收集各种来源的公开语音数据集方面进行了广泛的实验。关于轻量度SR模型的结果显示,拟议的无标签KD方法与对比损失始终比常规蒸馏方法和自检测的学习方法大差。