It is challenging to improve automatic speech recognition (ASR) performance in noisy conditions with single-channel speech enhancement (SE). In this paper, we investigate the causes of ASR performance degradation by decomposing the SE errors using orthogonal projection-based decomposition (OPD). OPD decomposes the SE errors into noise and artifact components. The artifact component is defined as the SE error signal that cannot be represented as a linear combination of speech and noise sources. We propose manually scaling the error components to analyze their impact on ASR. We experimentally identify the artifact component as the main cause of performance degradation, and we find that mitigating the artifact can greatly improve ASR performance. Furthermore, we demonstrate that the simple observation adding (OA) technique (i.e., adding a scaled version of the observed signal to the enhanced speech) can monotonically increase the signal-to-artifact ratio under a mild condition. Accordingly, we experimentally confirm that OA improves ASR performance for both simulated and real recordings. The findings of this paper provide a better understanding of the influence of SE errors on ASR and open the door to future research on novel approaches for designing effective single-channel SE front-ends for ASR.
翻译:在单声道语音增强(SE)的吵闹条件下,改进自动语音识别(ASR)性能是具有挑战性的。在本文件中,我们通过使用正向投影分解(OPD)将SE错误分解成噪音和人工制品部件,调查ASR性能退化的原因。人工制品部件被定义为SE错误信号,不能作为语音和噪音源的线性组合来表示。我们建议手工缩放错误部件,以分析其对ASR的影响。我们实验性地确定人工制品成分是性能退化的主要原因,我们发现减少人工制品可大大改善ASR性能。此外,我们证明简单观测添加(OA)技术(即为强化语音添加观察到信号的缩放版)可以在较温和的条件下单调地增加信号-电动比率。因此,我们实验性地确认OA改进了模拟和真实录音的ASR性能。本文的发现有助于更好地了解SER错误对ASR的影响,并打开SESR前门的门,以便今后对SESR进行新的有效研究。