Due to the high computational complexity to model more frequency bands, it is still intractable to conduct real-time full-band speech enhancement based on deep neural networks. Recent studies typically utilize the compressed perceptually motivated features with relatively low frequency resolution to filter the full-band spectrum by one-stage networks, leading to limited speech quality improvements. In this paper, we propose a coordinated sub-band fusion network for full-band speech enhancement, which aims to recover the low- (0-8 kHz), middle- (8-16 kHz), and high-band (16-24 kHz) in a step-wise manner. Specifically, a dual-stream network is first pretrained to recover the low-band complex spectrum, and another two sub-networks are designed as the middle- and high-band noise suppressors in the magnitude-only domain. To fully capitalize on the information intercommunication, we employ a sub-band interaction module to provide external knowledge guidance across different frequency bands. Extensive experiments show that the proposed method yields consistent performance advantages over state-of-the-art full-band baselines.
翻译:由于模拟更频带的计算复杂程度很高,仍然难以在深神经网络的基础上实时进行全频带语音增强工作。最近的研究通常利用较低频度分辨率的压缩感官特性,通过一阶段网络过滤全频谱,导致语音质量的提高有限。在本文中,我们提议为全频带语音增强工作建立一个协调的子波段聚变网络,目的是逐步恢复低频(0-8kHz)、中频(8-16kHz)和高频段(16-24kHz),具体地说,双流网络首先经过训练才能恢复低频谱谱谱,另外两个子网络被设计为仅限量域的中频和高频段噪音抑制器。为了充分利用信息互通,我们采用了一个子波段互动模块,以在不同频带提供外部知识指导。广泛实验显示,拟议的方法在最新全频带基线上产生一致的性能优势。