The objective of this work is to develop a speaker recognition model to be used in diverse scenarios. We hypothesise that two components should be adequately configured to build such a model. First, adequate architecture would be required. We explore several recent state-of-the-art models, including ECAPA-TDNN and MFA-Conformer, as well as other baselines. Second, a massive amount of data would be required. We investigate several new training data configurations combining a few existing datasets. The most extensive configuration includes over 87k speakers' 10.22k hours of speech. Four evaluation protocols are adopted to measure how the trained model performs in diverse scenarios. Through experiments, we find that MFA-Conformer with the least inductive bias generalises the best. We also show that training with proposed large data configurations gives better performance. A boost in generalisation is observed, where the average performance on four evaluation protocols improves by more than 20%. In addition, we also demonstrate that these models' performances can improve even further when increasing capacity.
翻译:这项工作的目标是开发一个语音识别模型,用于不同的假设情景。 我们假设, 两个组成部分应该有足够的配置来构建这样的模型。 首先, 需要适当的结构。 我们探索最近的一些最先进的模型, 包括 ECAPA- TDNN 和 MFA- Confrent, 以及其他基线。 第二, 需要大量的数据。 我们调查了几个新的培训数据配置, 将几个现有数据集结合起来。 最广泛的配置包括87k 语言10.22公里的演讲时间。 通过了四个评价协议, 以衡量经过培训的模型在不同情景中是如何运行的。 我们通过实验发现, MFA- Coneder 最不具有引入性偏差的概括性能。 我们还显示, 与拟议的大数据配置有关的培训效果更好。 观察到了总体化的推动, 四个评价协议的平均绩效提高了20%以上。 此外, 我们还表明, 这些模型的绩效在提高能力时还可以进一步提高。