Speaker identification systems in a real-world scenario are tasked to identify a speaker amongst a set of enrolled speakers given just a few samples for each enrolled speaker. This paper demonstrates the effectiveness of meta-learning and relation networks for this use case. We propose improved relation networks for speaker verification and few-shot (unseen) speaker identification. The use of relation networks facilitates joint training of the frontend speaker encoder and the backend model. Inspired by the use of prototypical networks in speaker verification and to increase the discriminability of the speaker embeddings, we train the model to classify samples in the current episode amongst all speakers present in the training set. Furthermore, we propose a new training regime for faster model convergence by extracting more information from a given meta-learning episode with negligible extra computation. We evaluate the proposed techniques on VoxCeleb, SITW and VCTK datasets on the tasks of speaker verification and unseen speaker identification. The proposed approach outperforms the existing approaches consistently on both tasks.
翻译:在现实世界情景下,发言人识别系统的任务是在一组登记发言者中确定一名发言者,对每个登记发言者只提供几个样本,本文展示了用于这一使用案例的元学习和关系网络的有效性。我们建议改进发言人核查和微小(未见)语音识别的关系网络。使用关系网络有利于对前端发言者编码器和后端模型进行联合培训。在使用典型的发言者网络进行语音验证的启发下,我们培训了模型,以便在培训集中所有发言者当前时段对样本进行分类。此外,我们提议了新的培训制度,通过从特定元学习中提取更多信息,进行可忽略不计的额外计算,从而加快模式的趋同速度。我们评估了VoxCeleb、SITW和VCTK关于语音验证和不可见的语音识别任务的拟议技术。拟议方法比两种任务都更加符合现行方法。