Deep speaker models yield low error rates in speaker verification. Nonetheless, the high performance tends to be exchanged for model size and computation time, making these models challenging to run under limited conditions. We focus on small-footprint deep speaker embedding extraction, leveraging knowledge distillation. While prior work on this topic has addressed speaker embedding extraction at the utterance level, we propose to combine embeddings from various levels of the x-vector model (teacher network) to train small-footprint student networks. Results indicate the usefulness of frame-level information, with the student models being 85%-91% smaller than their teacher, depending on the size of the teacher embeddings. Concatenation of teacher embeddings results in student networks that reach comparable performance along with the teacher while utilizing a 75% relative size reduction from the teacher. The findings and analogies are furthered to other x-vector variants.
翻译:深声器模型在音员校验中产生低误差率。 尽管如此, 高性能往往被换成模型大小和计算时间, 使得这些模型难以在有限条件下运行。 我们侧重于小脚的深声器嵌入, 利用知识蒸馏。 虽然先前关于这个专题的工作已经涉及在讲稿一级嵌入音员, 但我们建议将X- 矢量模型( 教师网络)不同层次的嵌入结合起来, 以培训小脚板学生网络。 结果显示框架级信息的有用性, 学生模型比教师嵌入规模小85%- 91%, 取决于教师嵌入规模。 将教师嵌入的教师结果体现在学生网络上, 与教师一起达到相似的成绩, 同时使用教师的75%的相对大小缩放。 研究结果和模拟将进一步推广到其他 x- 矢量变量 。</s>