In real application scenarios, it is often challenging to obtain a large amount of labeled data for speaker representation learning due to speaker privacy concerns. Self-supervised learning with no labels has become a more and more promising way to solve it. Compared with contrastive learning, self-distilled approaches use only positive samples in the loss function and thus are more attractive. In this paper, we present a comprehensive study on self-distilled self-supervised speaker representation learning, especially on critical data augmentation. Our proposed strategy of audio perturbation augmentation has pushed the performance of the speaker representation to a new limit. The experimental results show that our model can achieve a new SoTA on Voxceleb1 speaker verification evaluation benchmark ( i.e., equal error rate (EER) 2.505%, 2.473%, and 4.791% for trial Vox1-O, Vox1-E and Vox1-H , respectively), discarding any speaker labels in the training phase.
翻译:在实际应用情况下,由于对发言者隐私的关切,获取大量贴标签的发言者代言学习数据往往具有挑战性。无标签的自我监督学习已成为一种越来越有希望的解决方法。与对比性学习相比,自我蒸馏的方法在损失功能中只使用正样,因此更具吸引力。在本文中,我们提交了一份关于自我提炼自我监督的发言者代言学习的全面研究报告,特别是在关键数据增强方面。我们提出的音频扰动增强战略已经将发言者代言的表现推向新的限制。实验结果显示,我们的模型可以在Voxceleb1 演讲者核查评估基准上实现一个新的 SoTA(即平均误率(EER) 2505%、 2.473% 和4.791%分别用于试验Vox1-O、Vox1-E和Vox1-H),在培训阶段丢弃任何发言者标签。