DNN-based models achieve high performance in the speaker verification (SV) task with substantial computation costs. The model size is an essential concern in applying models on resource-constrained devices, while model compression for SV models has not been studied extensively in previous works. Weight quantization is exploited to compress DNN-based speaker embedding extraction models in this paper. Uniform and Powers-of-Two quantization are utilized in the experiments. The results on VoxCeleb show that the weight quantization can decrease the size of ECAPA-TDNN and ResNet by 4 times with insignificant performance decline. The quantized 4-bit ResNet achieves similar performance to the original model with an 8 times smaller size. We empirically show that the performance of ECAPA-TDNN is more sensitive than ResNet to quantization due to the difference in weight distribution. The experiments on CN-Celeb also demonstrate that quantized models are robust for SV in the language mismatch scenario.
翻译:以 DNN 为基础的模型在语音校验(SV) 任务中取得了很高的性能,计算成本很高。模型大小是应用资源限制装置模型的一个基本问题,而以往的工程尚未对SV模型的模型压缩进行广泛研究。微量分数被用来压缩基于 DNN 的发言者在本文中嵌入提取模型。实验中采用了统一和二倍分数法。VoxCeleb 的结果表明,重量四分制可以将ECAPA-TDNN 和ResNet的体积减少4倍,而性能下降幅度不大。量化的4位ResNet的性能与原模型相似,规模小8倍。我们从经验上表明,ECAPA-TDNNN 的性能比ResNet更敏感,因为重量分布不同,对四分法的性能。关于CN-Celeb的实验还表明,在语言不匹配的情况下,对 SV的四分式模型是强大的。