This paper explores applying the wav2vec2 framework to speaker recognition instead of speech recognition. We study the effectiveness of the pre-trained weights on the speaker recognition task, and how to pool the wav2vec2 output sequence into a fixed-length speaker embedding. To adapt the framework to speaker recognition, we propose a single-utterance classification variant with CE or AAM softmax loss, and an utterance-pair classification variant with BCE loss. Our best performing variant, w2v2-aam, achieves a 1.88% EER on the extended voxceleb1 test set compared to 1.69% EER with an ECAPA-TDNN baseline. Code is available at https://github.com/nikvaessen/w2v2-speaker.
翻译:本文探讨了将 wav2vec2 框架应用于语音识别而不是语音识别。 我们研究了预先培训的对语音识别任务重量的有效性,以及如何将 wav2vec2 输出序列整合成固定长度的语音嵌入器。 为了调整框架以适应语音识别, 我们提议了一个带有 CE 或 AAM 软麦斯损失的单一通量分类变量, 以及带有 BCE 损失的超量分类变量。 我们最好的功能变量 w2v2-aam 在扩展的 voxceleb1 测试组上实现了1.88% EER, 相比之下为1.69% EER, 其基准为 ECAPA-TDNN 。 代码可在 https://github. com/nikvaessen/w2v2-speaker 上查阅。