Recently, self-supervised learning (SSL) has demonstrated strong performance in speaker recognition, even if the pre-training objective is designed for speech recognition. In this paper, we study which factor leads to the success of self-supervised learning on speaker-related tasks, e.g. speaker verification (SV), through a series of carefully designed experiments. Our empirical results on the Voxceleb-1 dataset suggest that the benefit of SSL to SV task is from a combination of mask speech prediction loss, data scale, and model size, while the SSL quantizer has a minor impact. We further employ the integrated gradients attribution method and loss landscape visualization to understand the effectiveness of self-supervised learning for speaker recognition performance.
翻译:最近,自我监督的学习(SSL)在语音识别方面表现良好,即使培训前的目标是为语音识别设计的,我们也在本文中研究,通过一系列精心设计的实验,使自我监督的语音相关任务(如语音校验(SV))学习取得成功的因素是什么。我们在Voxceleb-1数据集上的经验结果表明,SSL对SV任务的好处在于将面具语音预测损失、数据规模和模型大小结合起来,而SSL量化工具的影响较小。 我们还进一步采用综合梯度归属法和损失景观可视化来理解自我监督的语音识别学习效果的效果。