Recent advances in unsupervised speech representation learning discover new approaches and provide new state-of-the-art for diverse types of speech processing tasks. This paper presents an investigation of using wav2vec 2.0 deep speech representations for the speaker recognition task. The proposed fine-tuning procedure of wav2vec 2.0 with simple TDNN and statistic pooling back-end using additive angular margin loss allows to obtain deep speaker embedding extractor that is well-generalized across different domains. It is concluded that Contrastive Predictive Coding pretraining scheme efficiently utilizes the power of unlabeled data, and thus opens the door to powerful transformer-based speaker recognition systems. The experimental results obtained in this study demonstrate that fine-tuning can be done on relatively small sets and a clean version of data. Using data augmentation during fine-tuning provides additional performance gains in speaker verification. In this study speaker recognition systems were analyzed on a wide range of well-known verification protocols: VoxCeleb1 cleaned test set, NIST SRE 18 development set, NIST SRE 2016 and NIST SRE 2019 evaluation set, VOiCES evaluation set, NIST 2021 SRE, and CTS challenges sets.
翻译:在未经监督的语音代表学习方面最近的进展,发现了新的方法,并为各类语音处理任务提供了新的最新技术。本文件对使用 wav2vec 2. 0 深层语音演示来进行语音识别任务进行了调查。 拟议的 wav2vec 2. 0 微调程序与简单的TDNNN 微调程序,以及利用添加式角差损失来将后端集中起来的统计,可以获取深层语音嵌入式提取器,该提取器在不同领域广为普及。 得出的结论是,竞争预测编码预培训计划有效地利用了无标签数据的力量,从而为强大的变压器语音识别系统打开了大门。 本研究获得的实验结果表明,可以对相对小的数据集和干净的数据版本进行微调。 在微调过程中使用数据增强提供了在语音核查方面的额外绩效收益。 在这项研究中,对广泛知名的核查协议(VoxCeeleb1清洁测试集、NIST SRE 18开发集、NIST SRE 2016 和 NIST SRE 2019 评估集、VIES 2021 SRE 和 CTS 挑战集) 进行了分析。