Neural transducers have achieved human level performance on standard speech recognition benchmarks. However, their performance significantly degrades in the presence of cross-talk, especially when the primary speaker has a low signal-to-noise ratio. Anchored speech recognition refers to a class of methods that use information from an anchor segment (e.g., wake-words) to recognize device-directed speech while ignoring interfering background speech. In this paper, we investigate anchored speech recognition to make neural transducers robust to background speech. We extract context information from the anchor segment with a tiny auxiliary network, and use encoder biasing and joiner gating to guide the transducer towards the target speech. Moreover, to improve the robustness of context embedding extraction, we propose auxiliary training objectives to disentangle lexical content from speaking style. We evaluate our methods on synthetic LibriSpeech-based mixtures comprising several SNR and overlap conditions; they improve relative word error rates by 19.6% over a strong baseline, when averaged over all conditions.
翻译:神经变换器已经在标准语音识别基准测试中达到了人类水平的性能。然而,在交叉谈话存在的情况下,它们的性能显著下降,尤其是在主讲者的信噪比较低的情况下。锚定语音识别是指使用锚定段(例如唤醒词)的信息来识别面向设备的语言并忽略干扰背景语音的一类方法。在本文中,我们研究了锚定语音识别,以使神经变换器对背景语音具有鲁棒性。我们使用一个微型辅助网络从锚定段中提取上下文信息,并使用编码器偏置和联结门将变换器引导至目标语音。此外,为了提高上下文嵌入提取的鲁棒性,我们提出辅助训练目标来将词汇内容与说话风格分离。我们在合成的LibriSpeech混合物上进行评估,并对多种SNR和重叠条件进行了平均处理,与强基准值相比,相对字错误率提高了19.6%。