Model architectures such as wav2vec 2.0 and HuBERT have been proposed to learn speech representations from audio waveforms in a self-supervised manner. When these models are combined with downstream tasks such as speech recognition, they have been shown to provide state-of-the-art performance. However, these models use a large number of parameters, the smallest version of which has about 95 million parameters. This constitutes a challenge for edge AI device deployments. In this paper, we use knowledge distillation to reduce the original model size by about 75% while maintaining similar performance levels. Moreover, we use wav2vec 2.0 and HuBERT models for distillation and present a comprehensive performance analysis through our experiments where we fine-tune the distilled models on single task and multi-task frameworks separately. In particular, our experiments show that fine-tuning the distilled models on keyword spotting and speaker verification tasks result in only 0.1% accuracy and 0.9% equal error rate degradations, respectively.
翻译:诸如 wav2vec 2. 0 和 HuBERT 等模型结构建议以自我监督的方式从音波形式中学习语音演示。 当这些模型与语音识别等下游任务相结合时, 显示它们提供了最先进的性能。 然而, 这些模型使用大量参数, 其中最小的版本有大约9,500万个参数。 这对边缘 AI 设备部署构成了挑战 。 在本文中, 我们使用知识蒸馏将原始模型的大小减少约75%, 同时保持类似的性能水平 。 此外, 我们使用 wav2vec 2.0 和 HuBERT 模型进行蒸馏, 并通过我们的实验提出全面性能分析, 我们通过这些实验对单个任务和多任务框架的蒸馏模型分别进行微调。 特别是, 我们的实验显示, 精细调整关键词点定位和语音校验任务上的蒸馏模型, 分别导致0.1% 的精度和 0.9% 的误率降解 。