演讲2Phone:新颖而有效的培训演讲人承认模式培训方法 (Speech2Phone: A Novel and Efficient Method for Training Speaker Recognition Models)

Edresson Casanova,Arnaldo Candido Junior,Christopher Shulby,Frederico Santos de Oliveira,Lucas Rafael Stefanel Gris,Hamilton Pereira da Silva,Sandra Maria Aluisio,Moacir Antonelli Ponti

from arxiv, Submitted to BRACIS

In this paper we present an efficient method for training models for speaker recognition using small or under-resourced datasets. This method requires less data than other SOTA (State-Of-The-Art) methods, e.g. the Angular Prototypical and GE2E loss functions, while achieving similar results to those methods. This is done using the knowledge of the reconstruction of a phoneme in the speaker's voice. For this purpose, a new dataset was built, composed of 40 male speakers, who read sentences in Portuguese, totaling approximately 3h. We compare the three best architectures trained using our method to select the best one, which is the one with a shallow architecture. Then, we compared this model with the SOTA method for the speaker recognition task: the Fast ResNet-34 trained with approximately 2,000 hours, using the loss functions Angular Prototypical and GE2E. Three experiments were carried out with datasets in different languages. Among these three experiments, our model achieved the second best result in two experiments and the best result in one of them. This highlights the importance of our method, which proved to be a great competitor to SOTA speaker recognition models, with 500x less data and a simpler approach.

翻译：在本文中,我们介绍了一种有效的培训模式,用于使用小型或资源不足的数据集对发言者进行识别的培训模式,这一方法比其他SOTA(国家-艺术)方法(国家-艺术)方法(例如角原形和GE2E损失功能)所需要的数据要少,同时取得与这些方法相似的结果,这是利用对重建发言者声音中的电话机的知识来完成的。为此目的,建立了一个由40名男性发言者组成的新的数据集,他们阅读葡萄牙语的句子,总共约3小时。我们比较了使用我们的方法来选择最佳的三种最佳结构,即浅色结构。然后,我们将这一模型与SOTA方法进行比较,用于识别发言者的任务:快速ResNet-34,培训了大约2 000小时,同时使用发言者声音中的角原形和GE2E。用不同语言的数据集进行了三次实验。在这三项实验中,我们的模型取得了第二次最佳结果,其中一次是两次实验,最佳结果之一。这突出表明了我们的方法的重要性,它证明是比SOTA模型更简单的数据识别方法更难。

相关内容

声纹识别

关注 444

说话人识别（Speaker Recognition），或者称为声纹识别（Voiceprint Recognition, VPR），是根据语音中所包含的说话人个性信息，利用计算机以及现在的信息识别技术，自动鉴别说话人身份的一种生物特征识别技术。说话人识别研究的目的就是从语音中提取具有说话人表征性的特征，建立有效的模型和系统，实现自动精准的说话人鉴别。