In many speech-enabled human-machine interaction scenarios, user speech can overlap with the device playback audio. In these instances, the performance of tasks such as keyword-spotting (KWS) and device-directed speech detection (DDD) can degrade significantly. To address this problem, we propose an implicit acoustic echo cancellation (iAEC) framework where a neural network is trained to exploit the additional information from a reference microphone channel to learn to ignore the interfering signal and improve detection performance. We study this framework for the tasks of KWS and DDD on, respectively, an augmented version of Google Speech Commands v2 and a real-world Alexa device dataset. Notably, we show a $56\%$ reduction in false-reject rate for the DDD task during device playback conditions. We also show comparable or superior performance over a strong end-to-end neural echo cancellation + KWS baseline for the KWS task with an order of magnitude less computational requirements.
翻译:在许多语音驱动的人体机器互动情景中,用户讲话可能与设备回放音频发生重叠。 在这些情况下,关键词定位(KWS)和设备定向语音检测(DDD)等任务的执行可以显著降低。为了解决这个问题,我们提议一个隐含的声响取消(iAEC)框架,通过该框架培训神经网络利用参考麦克风频道的额外信息,学会忽略干扰信号并改进探测性能。我们研究了KWS和DDD的任务框架,分别针对谷歌语音指令v2的扩大版本和一个真实世界的亚历克斯设备数据集。值得注意的是,我们在设备回放条件下DDD任务中显示的虚假弹出率下降了56美元。我们还展示了与KWS任务的强烈端到端断线取消+KWS基线的类似或优异性性性性性。