In this paper, we focus on improving the performance of the text-dependent speaker verification system in the scenario of limited training data. The speaker verification system deep learning based text-dependent generally needs a large scale text-dependent training data set which could be labor and cost expensive, especially for customized new wake-up words. In recent studies, voice conversion systems that can generate high quality synthesized speech of seen and unseen speakers have been proposed. Inspired by those works, we adopt two different voice conversion methods as well as the very simple re-sampling approach to generate new text-dependent speech samples for data augmentation purposes. Experimental results show that the proposed method significantly improves the Equal Error Rare performance from 6.51% to 4.51% in the scenario of limited training data.
翻译:在本文中,我们侧重于在有限培训数据的情况下改进依赖文本的发言者核查制度的性能。在有限培训数据的情况下,基于文本的深学习系统通常需要大规模基于文本的培训数据集,这些数据可能是人工的,成本昂贵,特别是针对定制的新醒醒词。在最近的研究中,提出了能够产生高品质的视觉和看不见演讲者综合演讲的语音转换系统。在这些作品的启发下,我们采用了两种不同的语音转换方法,以及非常简单的再抽样方法,为数据扩增目的生成新的依赖文本的语音样本。实验结果显示,在有限的培训数据情况下,拟议方法极大地提高了平等错误率,从6.51%提高到4.51%。