One of the most challenging scenarios for smart speakers is multi-talker, when target speech from the desired speaker is mixed with interfering speech from one or more speakers. A smart assistant needs to determine which voice to recognize and which to ignore and it needs to do so in a streaming, low-latency manner. This work presents two multi-microphone speech enhancement algorithms targeted at this scenario. Targeting on-device use-cases, we assume that the algorithm has access to the signal before the hotword, which is referred to as the noise context. First is the Context Aware Beamformer which uses the noise context and detected hotword to determine how to target the desired speaker. The second is an adaptive noise cancellation algorithm called Speech Cleaner which trains a filter using the noise context. It is demonstrated that the two algorithms are complementary in the signal-to-noise ratio conditions under which they work well. We also propose an algorithm to select which one to use based on estimated SNR. When using 3 microphone channels, the final system achieves a relative word error rate reduction of 55% at -12dB, and 43\% at 12dB.
翻译:对于聪明的发言者来说,最具挑战性的情景之一是多讲台,当想要的发言者的目标演讲与一位或多位发言者的干扰性演讲混在一起时,一个聪明的助理需要确定哪些声音需要识别,哪些需要忽略,需要以流态、低纬度的方式这样做。这项工作提出了针对这一情景的两种多声语音增强算法。在设计设备使用的情况下,我们假设算法可以进入热词前的信号,即噪音背景。首先,了解环境的信号显示,使用噪音背景和探测到的热词来确定如何瞄准想要的发言者。第二个是适应性噪音取消算法,称为“语音清洁”,用噪音背景来训练过滤器。这证明两种算法在信号到噪音比率条件下是相辅相成的。我们还提议一种算法,根据估计的SNR使用哪种算法。当使用3个麦克风频道时,最后的系统在-12dB和12dB上将相对字差率减少55%和43 ⁇ 。