为自动发言人核查进行大规模自我监督语音代表学习 (Large-scale Self-Supervised Speech Representation Learning for Automatic Speaker Verification)

The speech representations learned from large-scale unlabeled data have shown better generalizability than those from supervised learning and thus attract a lot of interest to be applied for various downstream tasks. In this paper, we explore the limits of speech representations learned by different self-supervised objectives and datasets for automatic speaker verification (ASV), especially with a well-recognized SOTA ASV model, ECAPA-TDNN [1], as a downstream model. The representations from all hidden layers of the pre-trained model are firstly averaged with learnable weights and then fed into the ECAPA-TDNN as input features. The experimental results on Voxceleb dataset show that the weighted average representation is significantly superior to FBank, a conventional handcrafted feature for ASV. Our best single system achieves 0.537%, 0.569%, and 1.180% equal error rate (EER) on the three official trials of VoxCeleb1, separately. Accordingly, the ensemble system with three pre-trained models can further improve the EER to 0.479%, 0.536% and 1.023%. Among the three evaluation trials, our best system outperforms the winner system [2] of the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC2021) on the VoxCeleb1-E trial.

翻译：从大规模未贴标签的数据中了解到的语音表述比从监督学习中了解到的数据更具有一般性,因此吸引了许多兴趣应用于各种下游任务。在本文件中,我们探索了不同自我监督的目标和自动语音校验数据集(ASV),特别是公认的SOTA ASV模式(ECAPA-TDNN[1])作为下游模式的语音表述的局限性。来自预先培训模式所有隐蔽层面的表述,首先以可学习的重量为平均值,然后作为输入特征输入到 ECAPA-TDNN。Voxceeleb数据集的实验结果表明,加权平均表述明显优于FBank,这是ASV的传统手工艺特征。我们最好的单一系统在三次正式试验VoxCeleb1时实现了0.537 %、0.569 %和1.180%等误率。因此,具有三种预先培训模式的混合系统可以进一步改进EERPER-VER%、0.536 %和1.023 %。在三次评价试验中,我们最好的单一系统在20CSVSVS-C 试验中,1号最佳系统将SEvigionS-CSV2xxxxxxxx 。

相关内容

表示学习

关注 186

表示学习是通过利用训练数据来学习得到向量表示，这可以克服人工方法的局限性。表示学习通常可分为两大类，无监督和有监督表示学习。大多数无监督表示学习方法利用自动编码器（如去噪自动编码器和稀疏自动编码器等）中的隐变量作为表示。目前出现的变分自动编码器能够更好的容忍噪声和异常值。然而，推断给定数据的潜在结构几乎是不可能的。目前有一些近似推断的策略。此外，一些无监督表示学习方法旨在近似某种特定的相似性度量。提出了一种无监督的相似性保持表示学习框架，该框架使用矩阵分解来保持成对的DTW相似性。通过学习保持DTW的shaplets，即在转换后的空间中的欧式距离近似原始数据的真实DTW距离。有监督表示学习方法可以利用数据的标签信息，更好地捕获数据的语义结构。孪生网络和三元组网络是目前两种比较流行的模型，它们的目标是最大化类别之间的距离并最小化了类别内部的距离。