State-of-the-art speaker verification systems are inherently dependent on some kind of human supervision as they are trained on massive amounts of labeled data. However, manually annotating utterances is slow, expensive and not scalable to the amount of data available today. In this study, we explore self-supervised learning for speaker verification by learning representations directly from raw audio. The objective is to produce robust speaker embeddings that have small intra-speaker and large inter-speaker variance. Our approach is based on recent information maximization learning frameworks and an intensive data augmentation pre-processing step. We evaluate the ability of these methods to work without contrastive samples before showing that they achieve better performance when combined with a contrastive loss. Furthermore, we conduct experiments to show that our method reaches competitive results compared to existing techniques and can get better performances compared to a supervised baseline when fine-tuned with a small portion of labeled data.
翻译:最先进的演讲者核查系统在本质上取决于某种人的监督,因为它们在大量标签数据方面受过培训。然而,人工说明的语句缓慢、昂贵,无法与今天的数据量相适应。在本研究中,我们探索自我监督学习,通过直接从原始音频中学习演示来核查演讲者。目的是产生强大的演讲者嵌入装置,其声音内部和声音之间差异较小。我们的方法基于最近的信息最大化学习框架和强化的数据增强前步骤。我们评估这些方法在没有对比性样本的情况下工作的能力,然后表明它们与对比性损失相结合,能够取得更好的性能。此外,我们还进行实验,以表明我们的方法与现有技术相比,取得了竞争性的结果,并且能够取得更好的性能,而与有监督的基线相比,当与少量的标签数据进行微调时,可以得到更好的性能。