Keyword spotting systems continuously process audio streams to detect keywords. One of the most challenging tasks in designing such systems is to reduce False Alarm (FA) which happens when the system falsely registers a keyword despite the keyword not being uttered. In this paper, we propose a simple yet elegant solution to this problem that follows from the law of total probability. We show that existing deep keyword spotting mechanisms can be improved by Successive Refinement, where the system first classifies whether the input audio is speech or not, followed by whether the input is keyword-like or not, and finally classifies which keyword was uttered. We show across multiple models with size ranging from 13K parameters to 2.41M parameters, the successive refinement technique reduces FA by up to a factor of 8 on in-domain held-out FA data, and up to a factor of 7 on out-of-domain (OOD) FA data. Further, our proposed approach is "plug-and-play" and can be applied to any deep keyword spotting model.
翻译:是否唤醒:通过连续改进降低关键词误报率
关键词检测系统不断处理音频流以检测关键词。设计此类系统中最具挑战性的任务之一是减少误报率(FA),即当系统错误地识别关键词时,尽管没有发出该关键词。在本文中,我们提出了一个简单而优雅的方法来解决这个问题,它依据于全概率公式。 我们展示了现有的深度关键词检测机制可以通过连续改进来改善。 具体来说,该方法首先将输入音频分类为是否是语音,然后分类是否像关键词一样,最后分类哪个关键词正在被发出。我们展示了在多个模型上,尺寸从13K参数到2.41M参数,连续改进技术将IDFA降低了长达8倍,在域外FA数据上降低了7倍。此外,我们提出的方法是“即插即用”的,可以应用于任何深度关键词检测模型。