The performances of automatic speech recognition (ASR) systems degrade drastically under noisy conditions. Explicit distortion modelling (EDM), as a feature compensation step, is able to enhance ASR systems under such conditions by simulating the in-domain noisy speeches from the clean counterparts. Yet, existing distortion models are either non-trainable or unexplainable and often lack controllability and generalization ability. In this paper, we propose a fully explainable and controllable model: DENT-DDSP to achieve EDM. DENT-DDSP utilizes novel differentiable digital signal processing (DDSP) components and requires only 10 seconds of training data to achieve high fidelity. The experiment shows that the simulated noisy data from DENT-DDSP achieves the highest simulation fidelity compared to other baseline models in terms of multi-scale spectral loss (MSSL). Moreover, to validate whether the data simulated by DENT-DDSP are able to replace the scarce in-domain noisy data in the noise-robust ASR tasks, several downstream ASR models with the same architecture are trained using the simulated data and the real data. The experiment shows that the model trained with the simulated noisy data from DENT-DDSP achieves similar performances to the benchmark with a 2.7\% difference in terms of word error rate (WER). The code of the model is released online.
翻译:自动语音识别系统(ASR)的性能在噪音条件下急剧退化。明确扭曲模型(EDM),作为特殊补偿步骤,能够通过模拟清洁对应方的室内吵闹言论,在这种条件下增强ASR系统;然而,现有的扭曲模型要么是无法操作的,要么无法解释,而且往往缺乏控制和概括能力。在本文中,我们提出了一个完全可以解释和控制的模式:DENT-DDSP实现EDM。DEND-DDSP使用新的、不同的数字信号处理(DDSP)组件,只需要10秒钟的培训数据来实现高度忠诚。实验表明,DENT-DDSP的模拟噪音数据在多尺度光谱损失方面,与其他基线模型相比,达到最高模拟的忠诚度,而且往往缺乏控制和概括能力。此外,为了验证DENT-DSP所模拟的数据是否能够取代在噪音-RODM任务中稀缺的噪音数据。一些下游的ASR模型和同一结构的下游的ASR模型使用模拟数据进行了培训,而经过培训的SIM-DSB的模型则用SER的模拟版本数据进行模拟版本的模型显示。