In this paper, we propose a solution to allow speaker conditioned speech models, such as VoiceFilter-Lite, to support an arbitrary number of enrolled users in a single pass. This is achieved by using an attention mechanism on multiple speaker embeddings to compute a single attentive embedding, which is then used as a side input to the model. We implemented multi-user VoiceFilter-Lite and evaluated it for three tasks: (1) a streaming automatic speech recognition (ASR) task; (2) a text-independent speaker verification task; and (3) a personalized keyphrase detection task, where ASR has to detect keyphrases from multiple enrolled users in a noisy environment. Our experiments show that, with up to four enrolled users, multi-user VoiceFilter-Lite is able to significantly reduce speech recognition and speaker verification errors when there is overlapping speech, without affecting performance under other acoustic conditions. This attentive speaker embedding approach can also be easily applied to other speaker-conditioned models such as personal VAD and personalized ASR.
翻译:在本文中,我们提出一个解决方案,允许使用诸如语音Filter-Lite等有特定条件的语音模型,支持任意数目的注册用户在单关卡中使用任意的注册用户。通过对多个语音嵌入器的注意机制来计算单一的专注嵌入器,然后将其作为该模型的侧面输入。我们实施了多用户语音过滤器-Lite,并评估了它的三个任务:(1) 自动语音识别(ASR)流传任务;(2) 文本独立语音验证任务;(3) 个人化关键词探测任务,该关键词探测任务需要由多个注册用户在吵闹的环境中探测关键词。我们的实验显示,在有4个注册用户的情况下,多用户语音过滤器能够大大减少语音识别和语音核实错误,而不会影响其他声学条件下的性能。这种谨慎的发言者嵌入方法也可以很容易适用于其他有特定语种的模式,例如个人VAD和个性化的ASR。