The speech representations learned from large-scale unlabeled data have shown better generalizability than those from supervised learning and thus attract a lot of interest to be applied for various downstream tasks. In this paper, we explore the limits of speech representations learned by different self-supervised objectives and datasets for automatic speaker verification (ASV), especially with a well-recognized SOTA ASV model, ECAPA-TDNN [1], as a downstream model. The representations from all hidden layers of the pre-trained model are firstly averaged with learnable weights and then fed into the ECAPA-TDNN as input features. The experimental results on Voxceleb dataset show that the weighted average representation is significantly superior to FBank, a conventional handcrafted feature for ASV. Our best single system achieves 0.537%, 0.569%, and 1.180% equal error rate (EER) on the three official trials of VoxCeleb1, separately. Accordingly, the ensemble system with three pre-trained models can further improve the EER to 0.479%, 0.536% and 1.023%. Among the three evaluation trials, our best system outperforms the winner system [2] of the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC2021) on the VoxCeleb1-E trial.
翻译:从大规模未贴标签的数据中了解到的语音表述比从监督学习中了解到的数据更具有一般性,因此吸引了许多兴趣应用于各种下游任务。在本文件中,我们探索了不同自我监督的目标和自动语音校验数据集(ASV),特别是公认的SOTA ASV模式(ECAPA-TDNN[1])作为下游模式的语音表述的局限性。来自预先培训模式所有隐蔽层面的表述,首先以可学习的重量为平均值,然后作为输入特征输入到 ECAPA-TDNN。Voxceeleb数据集的实验结果表明,加权平均表述明显优于FBank,这是ASV的传统手工艺特征。我们最好的单一系统在三次正式试验VoxCeleb1时实现了0.537 %、0.569 %和1.180%等误率。因此,具有三种预先培训模式的混合系统可以进一步改进EERPER-VER%、0.536 %和1.023 %。在三次评价试验中,我们最好的单一系统在20CSVSVS-C 试验中,1号最佳系统将SEvigionS-CSV2xxxxxxxx 。