We present a unified and hardware efficient architecture for two stage voice trigger detection (VTD) and false trigger mitigation (FTM) tasks. Two stage VTD systems of voice assistants can get falsely activated to audio segments acoustically similar to the trigger phrase of interest. FTM systems cancel such activations by using post trigger audio context. Traditional FTM systems rely on automatic speech recognition lattices which are computationally expensive to obtain on device. We propose a streaming transformer (TF) encoder architecture, which progressively processes incoming audio chunks and maintains audio context to perform both VTD and FTM tasks using only acoustic features. The proposed joint model yields an average 18% relative reduction in false reject rate (FRR) for the VTD task at a given false alarm rate. Moreover, our model suppresses 95% of the false triggers with an additional one second of post-trigger audio. Finally, on-device measurements show 32% reduction in runtime memory and 56% reduction in inference time compared to non-streaming version of the model.
翻译:我们为两个阶段语音触发检测(VTD)和假触发缓解(FTM)任务提出了一个统一和硬件高效的结构。两个阶段语音助理的VTD系统可以在声学上与触发语句相似的音频部分被错误地激活。 FTM系统通过使用后触发音频环境取消这种激活。传统的FTM系统依赖于自动语音识别拉特器,这些装置计算成本很高,以获得设备。我们提议了一个流式变压器(TF)编码器结构,该结构将逐步处理进入音频块,并保持音频环境,仅使用音频功能执行VTD和FTM任务。拟议的联合模型使VTD任务错误拒绝率平均降低18%,以给定一个错误的警报率。此外,我们的模型抑制了95%的假触发器,增加了一秒钟的触发后音频。最后,在设计上的测量显示运行时记忆减少32%,与模型的非流式相比,推断时间减少56%。