Wav2vec 2.0 is a recently proposed self-supervised framework for speech representation learning. It follows a two-stage training process of pre-training and fine-tuning, and performs well in speech recognition tasks especially ultra-low resource cases. In this work, we attempt to extend self-supervised framework to speaker verification and language identification. First, we use some preliminary experiments to indicate that wav2vec 2.0 can capture the information about the speaker and language. Then we demonstrate the effectiveness of wav2vec 2.0 on the two tasks respectively. For speaker verification, we obtain a new state-of-the-art result, Equal Error Rate (EER) of 3.61% on the VoxCeleb1 dataset. For language identification, we obtain an EER of 12.02% on 1 second condition and an EER of 3.47% on full-length condition of the AP17-OLR dataset. Finally, we utilize one model to achieve the unified modeling by the multi-task learning for the two tasks.
翻译:Wav2vec 2. 0是最近提出的语言代言学习自我监督框架。 它遵循了培训前和微调的两阶段培训过程,并出色地完成了语音识别任务,特别是超低资源案例。 在这项工作中,我们试图将自我监督的框架扩大到语音校验和语言识别。 首先, 我们使用一些初步实验来表明 wav2vec 2. 0 能够捕捉有关语言和语言的信息。 然后我们分别展示了 wav2vec 2. 0 在两种任务上的有效性。 对于演讲者核查,我们获得了一个新的最新结果: VoxCeleb1数据集3.61%的平等错误率。 对于语言识别,我们获得了12.02%的EER, 在AP17- OLR数据集全长条件下获得了3.47%的EER。 最后, 我们使用一种模型来通过多任务学习实现统一的模型。