Streaming keyword spotting is a widely used solution for activating voice assistants. Deep Neural Networks with Hidden Markov Model (DNN-HMM) based methods have proven to be efficient and widely adopted in this space, primarily because of the ability to detect and identify the start and end of the wake-up word at low compute cost. However, such hybrid systems suffer from loss metric mismatch when the DNN and HMM are trained independently. Sequence discriminative training cannot fully mitigate the loss-metric mismatch due to the inherent Markovian style of the operation. We propose an low footprint CNN model, called HEiMDaL, to detect and localize keywords in streaming conditions. We introduce an alignment-based classification loss to detect the occurrence of the keyword along with an offset loss to predict the start of the keyword. HEiMDaL shows 73% reduction in detection metrics along with equivalent localization accuracy and with the same memory footprint as existing DNN-HMM style models for a given wake-word.
翻译:Streaming 关键词定位是激活语音助理的一种广泛使用的解决方案。 以隐藏 Markov 模型( DNN-HMM) 为基础的深神经网络方法已证明是高效的,并在此空间得到广泛采用,这主要是因为能够以低计算成本探测和识别警醒词的开始和结束,然而,当DNN 和 HMM独立培训时,这种混合系统会遭受损失指标不匹配。由于操作固有的Markovian 风格, 顺序歧视培训无法完全减轻损失计量不匹配。 我们提议使用一个低足迹的CNN 模式,称为HEIMDaL, 以探测和定位流传条件中的关键词。 我们采用了基于校对的分类损失,以探测关键词的发生情况,同时抵消损失以预测关键词的开始。 HEIMDAL 显示,探测指标减少73%,同时具有同等的本地化精确度,并且与给定的DNN- HMM 风格模型相同的记忆足迹。