In many speech-enabled human-machine interaction scenarios, user speech can overlap with the device playback audio. In these instances, the performance of tasks such as keyword-spotting (KWS) and device-directed speech detection (DDD) can degrade significantly. To address this problem, we propose an implicit acoustic echo cancellation (iAEC) framework where a neural network is trained to exploit the additional information from a reference microphone channel to learn to ignore the interfering signal and improve detection performance. We study this framework for the tasks of KWS and DDD on, respectively, an augmented version of Google Speech Commands v2 and a real-world Alexa device dataset. Notably, we show a 56% reduction in false-reject rate for the DDD task during device playback conditions. We also show comparable or superior performance over a strong end-to-end neural echo cancellation + KWS baseline for the KWS task with an order of magnitude less computational requirements.
翻译:在许多语音- 机器互动情况下, 用户的语音可以与设备回放音频重叠。 在这些情况下, 关键词定位( KWS) 和设备引导语音检测( DDD) 等任务的执行可以显著降低。 为了解决这个问题, 我们提议一个隐含的声响取消( iAEC ) 框架, 用于对神经网络进行培训, 以利用参考麦克风频道的额外信息, 学习忽略干扰信号, 并改进探测性能。 我们研究KWS 和 DDD的任务框架, 分别针对谷歌语音指令 v2 和真实世界 Alexa 设备数据集的强化版本。 值得注意的是, 在设备回放状态下, 我们显示DDDD任务的错误弹出率降低了56% 。 我们还展示了类似或优异性性性性表现, 强的终端- 终端回音取消 + KWS 基线用于 KWS 任务, 其数量将降低计算要求 。