Training robust speaker verification systems without speaker labels has long been a challenging task. Previous studies observed a large performance gap between self-supervised and fully supervised methods. In this paper, we apply a non-contrastive self-supervised learning framework called DIstillation with NO labels (DINO) and propose two regularization terms applied to embeddings in DINO. One regularization term guarantees the diversity of the embeddings, while the other regularization term decorrelates the variables of each embedding. The effectiveness of various data augmentation techniques are explored, on both time and frequency domain. A range of experiments conducted on the VoxCeleb datasets demonstrate the superiority of the regularized DINO framework in speaker verification. Our method achieves the state-of-the-art speaker verification performance under a single-stage self-supervised setting on VoxCeleb. The codes will be made publicly-available.
翻译:以往的研究发现,自我监督和完全监督的方法之间存在很大的绩效差距。 在本文中,我们采用了一个非争议性自我监督的学习框架,称为NO标签(DINO),并提出了适用于在DINO中嵌入的两种正规化条件。一个正规化术语保证嵌入器的多样性,而另一个正规化术语则调整了每个嵌入器的变量。在时间和频率领域探索了各种数据增强技术的有效性。在VoxCeleb数据集上进行的一系列实验表明,在语音核查中,常规化的DINO框架具有优势。我们的方法在VoxCeleb的单一阶段自我监督设置下实现了最先进的语音验证功能。代码将被公诸于众。