In this paper we present an efficient method for training models for speaker recognition using small or under-resourced datasets. This method requires less data than other SOTA (State-Of-The-Art) methods, e.g. the Angular Prototypical and GE2E loss functions, while achieving similar results to those methods. This is done using the knowledge of the reconstruction of a phoneme in the speaker's voice. For this purpose, a new dataset was built, composed of 40 male speakers, who read sentences in Portuguese, totaling approximately 3h. We compare the three best architectures trained using our method to select the best one, which is the one with a shallow architecture. Then, we compared this model with the SOTA method for the speaker recognition task: the Fast ResNet-34 trained with approximately 2,000 hours, using the loss functions Angular Prototypical and GE2E. Three experiments were carried out with datasets in different languages. Among these three experiments, our model achieved the second best result in two experiments and the best result in one of them. This highlights the importance of our method, which proved to be a great competitor to SOTA speaker recognition models, with 500x less data and a simpler approach.
翻译:在本文中,我们介绍了一种有效的培训模式,用于使用小型或资源不足的数据集对发言者进行识别的培训模式,这一方法比其他SOTA(国家-艺术)方法(国家-艺术)方法(例如角原形和GE2E损失功能)所需要的数据要少,同时取得与这些方法相似的结果,这是利用对重建发言者声音中的电话机的知识来完成的。为此目的,建立了一个由40名男性发言者组成的新的数据集,他们阅读葡萄牙语的句子,总共约3小时。我们比较了使用我们的方法来选择最佳的三种最佳结构,即浅色结构。然后,我们将这一模型与SOTA方法进行比较,用于识别发言者的任务:快速ResNet-34,培训了大约2 000小时,同时使用发言者声音中的角原形和GE2E。用不同语言的数据集进行了三次实验。在这三项实验中,我们的模型取得了第二次最佳结果,其中一次是两次实验,最佳结果之一。这突出表明了我们的方法的重要性,它证明是比SOTA模型更简单的数据识别方法更难。