Neural transducers have gained popularity in production ASR systems, achieving human level recognition accuracy on standard benchmark datasets. However, their performance significantly degrades in the presence of crosstalks, especially when the background speech/noise is non-negligible as compared to the primary speech (i.e. low signal-to-noise ratio). Anchored speech recognition refers to a class of methods that use information from an anchor segment (e.g., wake-words) to recognize device-directed speech while ignoring interfering background speech/noise. In this paper, we investigate anchored speech recognition in the context of neural transducers. We use a tiny auxiliary network to extract context information from the anchor segment, and explore encoder biasing and joiner gating to guide the transducer towards the target speech. Moreover, to improve the robustness of context embedding extraction, we propose auxiliary training objectives to disentagle lexical content from speaking style. Our proposed methods are evaluated on synthetic LibriSpeech-based mixtures, where they improve word error rates by up to 36% compared to a background augmentation baseline.
翻译:神经感应器在制作 ASR 系统中越来越受欢迎,在标准基准数据集中实现了人的水平识别准确性。然而,它们的性能在交会中显著下降,特别是当背景演讲/噪音与主要演讲(即信号对噪音比率低)相比不易忽略时,尤其当背景演讲/噪音与主要演讲(即信号对噪音比率低)相比不显眼时。预言识别是指使用锚段(例如警醒词)信息识别设备引导的言语,而忽视干扰背景演讲/噪音的某类方法。在本文中,我们调查了神经感应器中嵌入的言语识别。我们使用一个微小的辅助网络从锁定部分提取背景信息,并探索编码器偏差和连接器指导转导器走向目标演讲。此外,为了提高嵌入的语系的稳健性,我们提出了辅助培训目标,以便从语音风格中分离出词汇内容。我们提出的方法在合成的LiSpeech 混合物上进行了评估,其中将文字错误率提高至36%,与背景放大基线。