In this paper, we describe SpeakerStew - a hybrid system to perform speaker verification on 46 languages. Two core ideas were explored in this system: (1) Pooling training data of different languages together for multilingual generalization and reducing development cycles; (2) A triage mechanism between text-dependent and text-independent models to reduce runtime cost and expected latency. To the best of our knowledge, this is the first study of speaker verification systems at the scale of 46 languages. The problem is framed from the perspective of using a smart speaker device with interactions consisting of a wake-up keyword (text-dependent) followed by a speech query (text-independent).Experimental evidence suggests that training on multiple languages can generalize to unseen varieties while maintaining performance on seen varieties. We also found that it can reduce computational requirements for training models by an order of magnitude. Furthermore, during model inference on English data, we observe that leveraging a triage framework can reduce the number of calls to the more computationally expensive text-independent system by 73% (and reduce latency by 60%) while maintaining an EER no worse than the text-independent setup.
翻译:在本文中,我们描述了SpeakerStew -- -- 一个对46种语言进行发言者核查的混合系统,在这个系统中探讨了两个核心想法:(1) 将不同语文的培训数据汇集在一起,以便进行多语种的概括化和减少发展周期;(2) 依赖文本和依赖文本的模式之间的分级机制,以减少运行时间成本和预期的延迟性。据我们所知,这是对46种语言规模的语音核查系统的首次研究。问题来自使用智能语音设备,由唤醒关键词(依赖文本)构成的互动,然后是语音查询(依赖文本)。实验证据表明,多语种培训在保持所见品种的性能的同时,可以将不可见的品种概括为看不见的品种。我们还发现,它可以以一定的规模减少培训模式的计算要求。此外,在英国数据模型中,我们发现,利用三角框架可以将使用更计算昂贵的文本依赖系统的呼声数量减少73%(和延缩率减少60%),同时保持不比依赖文本的设置更差的EER。