Although few-shot learning has attracted much attention from the fields of image and audio classification, few efforts have been made on few-shot speaker identification. In the task of few-shot learning, overfitting is a tough problem mainly due to the mismatch between training and testing conditions. In this paper, we propose a few-shot speaker identification method which can alleviate the overfitting problem. In the proposed method, the model of a depthwise separable convolutional network with channel attention is trained with a prototypical loss function. Experimental datasets are extracted from three public speech corpora: Aishell-2, VoxCeleb1 and TORGO. Experimental results show that the proposed method exceeds state-of-the-art methods for few-shot speaker identification in terms of accuracy and F-score.
翻译:尽管在图像和音频分类领域很少人注意到了微小的学习,但在微小的扬声器识别方面没有作出多少努力。在微小的学习任务中,过度装配是一个棘手的问题,主要是因为培训和测试条件不匹配。在本文中,我们建议了一个微小的扬声器识别方法,可以缓解过分装配的问题。在拟议的方法中,以一种原型损失功能对具有频道注意力的深度可分离的同声器网络模型进行了培训。从三个公开演讲公司Aishell-2、VoxCeleb1和TORGO中提取了实验数据集。实验结果显示,在准确性和F分数方面,拟议的方法超过了少数扬声器识别的最新方法。