The most recent deep neural network (DNN) models exhibit impressive denoising performance in the time-frequency (T-F) magnitude domain. However, the phase is also a critical component of the speech signal that is easily overlooked. In this paper, we propose a multi-branch dilated convolutional network (DCN) to simultaneously enhance the magnitude and phase of noisy speech. A causal and robust monaural speech enhancement system is achieved based on the multi-objective learning framework of the complex spectrum and the ideal ratio mask (IRM) targets. In the process of joint learning, the intermediate estimation of IRM targets is used as a way of generating feature attention factors to realize the information interaction between the two targets. Moreover, the proposed multi-scale dilated convolution enables the DCN model to have a more efficient temporal modeling capability. Experimental results show that compared with other state-of-the-art models, this model achieves better speech quality and intelligibility with less computation.
翻译:最新的深神经网络模型(DNN)在时频(T-F)级域显示令人印象深刻的分解性性能,然而,该阶段也是容易忽略的语音信号的关键组成部分。在本文中,我们提议建立一个多部门扩展变异网络(DCN),以同时提高吵闹言论的规模和阶段。根据复杂频谱的多目标学习框架和理想比例掩码(IRM)目标,实现一个因果而强的寺庙扩音系统。在联合学习过程中,对IRM目标的中间估计被用作产生特征关注因素的一种方法,以实现两个目标之间的信息互动。此外,拟议的多规模变异使DCN模型具有更有效的时间模型能力。实验结果表明,与其他最先进的模型相比,这一模型实现了更好的语音质量和智能化,而较少计算。