Personalization of on-device speech recognition (ASR) has seen explosive growth in recent years, largely due to the increasing popularity of personal assistant features on mobile devices and smart home speakers. In this work, we present Personal VAD 2.0, a personalized voice activity detector that detects the voice activity of a target speaker, as part of a streaming on-device ASR system. Although previous proof-of-concept studies have validated the effectiveness of Personal VAD, there are still several critical challenges to address before this model can be used in production: first, the quality must be satisfactory in both enrollment and enrollment-less scenarios; second, it should operate in a streaming fashion; and finally, the model size should be small enough to fit a limited latency and CPU/Memory budget. To meet the multi-faceted requirements, we propose a series of novel designs: 1) advanced speaker embedding modulation methods; 2) a new training paradigm to generalize to enrollment-less conditions; 3) architecture and runtime optimizations for latency and resource restrictions. Extensive experiments on a realistic speech recognition system demonstrated the state-of-the-art performance of our proposed method.
翻译:近些年来,由于移动装置和智能家庭扬声器上的个人助理功能越来越受欢迎,个人语音识别装置(ASR)的个性化近年来出现了爆炸性增长,这主要是因为移动装置和智能家庭扬声器上的个人助理功能越来越受欢迎。在这项工作中,我们提出了个人 VAD 2.0 个人化的声音活动探测器,该探测器可探测目标发言人的语音活动,作为ASR 系统流传的系统的一部分。虽然以前的概念证明研究证实了个人语音识别装置的有效性,但在制作该模型之前仍有若干关键的挑战需要解决:第一,在入学和无入学情况下,质量必须令人满意;第二,它应当以流传方式运作;最后,模型尺寸应该小到足以适应有限的拉长和CPU/Mory预算。为了达到多面的要求,我们提议了一系列新设计:(1) 高级演讲员嵌入调制方法;(2) 一种普及无入学条件的新的培训模式;(3) 供消化和资源限制的架构和运行时间优化。在现实的语音识别系统上进行的广泛实验,展示了我们拟议方法的状态。