Customized keyword spotting (KWS) has great potential to be deployed on edge devices to achieve hands-free user experience. However, in real applications, false alarm (FA) would be a serious problem for spotting dozens or even hundreds of keywords, which drastically affects user experience. To solve this problem, in this paper, we leverage the recent advances in transducer and transformer based acoustic models and propose a new multi-stage customized KWS framework named Cascaded Transducer-Transformer KWS (CaTT-KWS), which includes a transducer based keyword detector, a frame-level phone predictor based force alignment module and a transformer based decoder. Specifically, the streaming transducer module is used to spot keyword candidates in audio stream. Then force alignment is implemented using the phone posteriors predicted by the phone predictor to finish the first stage keyword verification and refine the time boundaries of keyword. Finally, the transformer decoder further verifies the triggered keyword. Our proposed CaTT-KWS framework reduces FA rate effectively without obviously hurting keyword recognition accuracy. Specifically, we can get impressively 0.13 FA per hour on a challenging dataset, with over 90% relative reduction on FA comparing to the transducer based detection model, while keyword recognition accuracy only drops less than 2%.
翻译:自定义关键字检测( KWS) 有很大潜力在边缘设备上部署, 以实现无手用户经验。 但是, 在实际应用程序中, 假提醒( FA) 将是发现数十个甚至数百个关键字的严重问题, 这会严重影响用户经验。 为了解决这个问题, 在本文件中, 我们利用基于传输器和变压器的声学模型的最新进展, 并提议一个新的多阶段定制 KWS 框架, 名为 Cascaed Transporter- Transported KWS (CATT- KWS), 其中包括一个基于传输器的关键字探测器、 一个基于框架的电话预测器对齐配对模块和一个基于变压器的解码器。 具体地说, 流动导器模块将用来在音频流中定位关键字候选人。 然后, 使用电话预测器预测器预测的手机外观器完成第一阶段关键字校验并改进关键字的时间范围。 最后, 变换器解码器将进一步验证触发的关键字( CATT- KWS) 。 我们提议的框架将有效降低 FA 率,, 但不明显伤害关键字识别准确性。 。 。 。 具体地说, 我们只能 将只能 0.13 。 我们只能 。