NIST SRE 2021的STC 扬声器识别系统 (STC speaker recognition systems for the NIST SRE 2021)

Anastasia Avdeeva,Aleksei Gusev,Igor Korsunov,Alexander Kozlov,Galina Lavrentyeva,Sergey Novoselov,Timur Pekhovsky,Andrey Shulipa,Alisa Vinogradova,Vladimir Volokhov,Evgeny Smirnov,Vasily Galyuk

This paper presents a description of STC Ltd. systems submitted to the NIST 2021 Speaker Recognition Evaluation for both fixed and open training conditions. These systems consists of a number of diverse subsystems based on using deep neural networks as feature extractors. During the NIST 2021 SRE challenge we focused on the training of the state-of-the-art deep speaker embeddings extractors like ResNets and ECAPA networks by using additive angular margin based loss functions. Additionally, inspired by the recent success of the wav2vec 2.0 features in automatic speech recognition we explored the effectiveness of this approach for the speaker verification filed. According to our observation the fine-tuning of the pretrained large wav2vec 2.0 model provides our best performing systems for open track condition. Our experiments with wav2vec 2.0 based extractors for the fixed condition showed that unsupervised autoregressive pretraining with Contrastive Predictive Coding loss opens the door to training powerful transformer-based extractors from raw speech signals. For video modality we developed our best solution with RetinaFace face detector and deep ResNet face embeddings extractor trained on large face image datasets. The final results for primary systems were obtained by different configurations of subsystems fusion on the score level followed by score calibration.

翻译：本文介绍了向NIST 2021 发言人确认评价有限公司提交的固定和开放式培训条件的STC Ltd系统,这些系统包括若干基于使用深神经网络作为特征提取器的不同子系统。在 NIST 2021 SRE 挑战期间,我们的重点是通过使用基于RESNets和ECAPA网络等最先进的深声器嵌入器提取器的培训,为此使用了基于抗抗控的三角边偏差损失功能。此外,由于在自动语音识别中的 wav2vec 2.0 功能最近取得了成功,我们探索了这一方法对所提交演讲者核查的有效性。根据我们的观察,对预先训练的大型 wav2vec 2. 0 模式的微调为我们提供了最佳的开放轨道条件系统。我们用基于 ResNets 和 ECAPA 的基于固定条件的 wav2 2. 0 提取器进行的实验表明,通过抗控性自控性磁力前训练,为从原始语音信号中培训强大的变压器提取器的提取器打开了大门。对于视频模式,我们与Retinaface 脸探测器和深ResNet 脸嵌入式2.0系统的最佳解决方案, 遵循了通过大图像升级升级的升级的升级系统最终的升级配置。

相关内容

声纹识别

关注 444

说话人识别（Speaker Recognition），或者称为声纹识别（Voiceprint Recognition, VPR），是根据语音中所包含的说话人个性信息，利用计算机以及现在的信息识别技术，自动鉴别说话人身份的一种生物特征识别技术。说话人识别研究的目的就是从语音中提取具有说话人表征性的特征，建立有效的模型和系统，实现自动精准的说话人鉴别。

对比学习简述

专知会员服务

90+阅读 · 2021年6月29日

MIT经典《线性代数》，584页pdf，Introduction to Linear Algebra, Fifth Edition, Gilbert Strang, 2016.

专知会员服务

426+阅读 · 2021年1月11日