In this paper, we propose self-supervised speaker representation learning strategies, which comprise of a bootstrap equilibrium speaker representation learning in the front-end and an uncertainty-aware probabilistic speaker embedding training in the back-end. In the front-end stage, we learn the speaker representations via the bootstrap training scheme with the uniformity regularization term. In the back-end stage, the probabilistic speaker embeddings are estimated by maximizing the mutual likelihood score between the speech samples belonging to the same speaker, which provide not only speaker representations but also data uncertainty. Experimental results show that the proposed bootstrap equilibrium training strategy can effectively help learn the speaker representations and outperforms the conventional methods based on contrastive learning. Also, we demonstrate that the integrated two-stage framework further improves the speaker verification performance on the VoxCeleb1 test set in terms of EER and MinDCF.
翻译:在本文中,我们提出自我监督的演讲者代表性学习战略,其中包括前端均衡演讲者代表性学习和后端隐含不确定性的概率演讲者培训;在前端阶段,我们通过统一规范化术语的“靴带”培训计划学习演讲者代表;在后端阶段,通过最大限度地提高属于同一演讲者的演讲样本之间的相互可能性分数来估计概率演讲者嵌入,这些样本不仅提供演讲者代表,而且提供数据不确定性。实验结果表明,拟议的“靴带平衡培训战略”能够有效地帮助学习演讲者代表,并超越以对比性学习为基础的常规方法。此外,我们还表明,综合的两阶段框架进一步提高了演讲者在VoxCeleb1测试中以ER和MDCF术语设定的“VoxCeleb1”测试的核查业绩。