For the difficulty and large computational complexity of modeling more frequency bands, full-band speech enhancement based on deep neural networks is still challenging. Previous studies usually adopt compressed full-band speech features in Bark and ERB scale with relatively low frequency resolution, leading to degraded performance, especially in the high-frequency region. In this paper, we propose a decoupling-style multi-band fusion model to perform full-band speech denoising and dereverberation. Instead of optimizing the full-band speech by a single network structure, we decompose the full-band target into multi sub-band speech features and then employ a multi-stage chain optimization strategy to estimate clean spectrum stage by stage. Specifically, the low- (0-8 kHz), middle- (8-16 kHz), and high-frequency (16-24 kHz) regions are mapped by three separate sub-networks and are then fused to obtain the full-band clean target STFT spectrum. Comprehensive experiments on two public datasets demonstrate that the proposed method outperforms previous advanced systems and yields promising performance in terms of speech quality and intelligibility in real complex scenarios.
翻译:对于更频带建模的难度和庞大的计算复杂性而言,基于深神经网络的全频语音增强仍然具有挑战性。以往的研究通常在巴克和ERB中采用压缩的全频语音特征,其频率分辨率相对较低,导致性能退化,特别是在高频区域。在本文中,我们建议采用脱钩式多频带聚合模型,以进行全频带语音拆卸和脱钩。我们不通过单一网络结构优化全频带语音,而是将全频带目标分解为多子频段语音功能,然后采用多级链优化战略,按阶段估计清洁频谱阶段。具体地说,低频段(0-8kHz)、中频段(8-16kHz)和高频(16-24kHz)区域由三个单独的子网络绘制地图,然后进行整合,以获得全频带清洁目标STFT频谱。两个公共数据集的全面实验表明,拟议的方法超越了以前的先进系统,在真实复杂情景下,在语音质量和可视性方面产生了有希望的业绩。