We present a frontend for improving robustness of automatic speech recognition (ASR), that jointly implements three modules within a single model: acoustic echo cancellation, speech enhancement, and speech separation. This is achieved by using a contextual enhancement neural network that can optionally make use of different types of side inputs: (1) a reference signal of the playback audio, which is necessary for echo cancellation; (2) a noise context, which is useful for speech enhancement; and (3) an embedding vector representing the voice characteristic of the target speaker of interest, which is not only critical in speech separation, but also helpful for echo cancellation and speech enhancement. We present detailed evaluations to show that the joint model performs almost as well as the task-specific models, and significantly reduces word error rate in noisy conditions even when using a large-scale state-of-the-art ASR model. Compared to the noisy baseline, the joint model reduces the word error rate in low signal-to-noise ratio conditions by at least 71% on our echo cancellation dataset, 10% on our noisy dataset, and 26% on our multi-speaker dataset. Compared to task-specific models, the joint model performs within 10% on our echo cancellation dataset, 2% on the noisy dataset, and 3% on the multi-speaker dataset.
翻译:我们为改进自动语音识别(ASR)的稳健性展示了一个前端,该前端在单一模式下联合执行了三个模块:声音回声取消、语音增强和语音分离。这是通过使用环境增强神经网络实现的,这种网络可以选择使用不同类型的侧面输入:(1) 重播音音的参考信号,这是取消回音所必需的;(2) 噪音环境,这有利于语音增强;(3) 嵌入矢量,它代表了有兴趣的演讲对象的声音特征,不仅对语音分离至关重要,而且有助于回音取消和语音增强。我们提出详细评价,以显示联合模型几乎和特定任务模型同时运行,并显著降低噪音条件下的字差率,即使使用大规模状态的艺术 ASR 模型也是如此。与噪音基线相比,联合模型将低信号到声音比率情况下的字差率率降低71%,我们的音响取消数据集为10%,多发言人数据集为26%。与任务特定模型相比,联合模型在10%的数据取消率基础上,在10 % 联合模型在数据设置内进行。