Training speaker-discriminative and robust speaker verification systems without speaker labels is still challenging and worthwhile to explore. In this study, we propose an effective self-supervised learning framework and a novel regularization strategy to facilitate self-supervised speaker representation learning. Different from contrastive learning-based self-supervised learning methods, the proposed self-supervised regularization (SSReg) focuses exclusively on the similarity between the latent representations of positive data pairs. We also explore the effectiveness of alternative online data augmentation strategies on both the time domain and frequency domain. With our strong online data augmentation strategy, the proposed SSReg shows the potential of self-supervised learning without using negative pairs and it can significantly improve the performance of self-supervised speaker representation learning with a simple Siamese network architecture. Comprehensive experiments on the VoxCeleb datasets demonstrate that our proposed self-supervised approach obtains a 23.4% relative improvement by adding the effective self-supervised regularization and outperforms other previous works.
翻译:在这项研究中,我们提出了一个有效的自我监督学习框架和新的规范化战略,以促进自我监督的演讲人代表制学习。与对比式学习的自我监督式学习方法不同,拟议的自我监督的规范化(SSReg)完全侧重于正对数据的潜在表现的相似性。我们还探索了时间域域和频率域的替代在线数据增强战略的有效性。由于我们强有力的在线数据增强战略,拟议的SUReg展示了自我监督学习的潜力而不使用负对子,它能够大大改善以简单的暹罗网络结构进行自我监督的演讲人代表制学习的绩效。关于VoxCeleb数据集的全面实验表明,我们拟议的自我监督方法通过添加有效的自我监督的正规化和超越以往的其他工作而取得了23.4%的相对改进。