Full-band speech enhancement based on deep neural networks is still challenging for the difficulty of modeling more frequency bands. Previous studies usually adopt compressed full-band speech features in Bark and ERB scale with relatively low frequency resolution, leading to degraded performance, especially in the high-frequency region. In this paper, we propose a decoupling-style multi-band fusion model to perform full-band speech denoising and dereverberation. Instead of optimizing the full-band speech by a single network structure, we decompose the full-band target into multi sub bands and then employ a multi-stage chain optimization strategy to estimate clean spectrum stage by stage. Specifically, the low- (0-8 kHz), middle- (8-16 kHz), and high-frequency (16-24 kHz) regions are mapped by three separate sub-networks and are then fused to obtain the full-band clean target STFT spectrum. Comprehensive experiments on two public datasets demonstrate that the proposed method outperforms previous advanced systems and yields promising performance in terms of speech quality and intelligibility in real complex scenarios.
翻译:基于深神经网络的全频语音增强对于制作更多频带模型的难度仍然具有挑战性。 以往的研究通常在巴克和ERB中采用压缩的全频语音特征,其频率分辨率相对较低,导致性能退化,特别是在高频区域。 在本文中,我们提议采用脱钩式多频谱聚合模型,以进行全频谱语音分解和脱钩。我们没有通过单一网络结构优化全频谱,而是将全频谱目标分解成多亚频带,然后采用多阶段链优化战略,按阶段估计清洁频谱阶段。具体地说,低频(0-8kHz)、中频(8-16kHz)和高频(16-24kHz)区域由三个单独的子网络绘制地图,然后结合,以获得全频带清洁目标STFT频谱。对两个公共数据集进行的全面实验表明,拟议方法超越了以前的先进系统,在真实复杂情景下在语音质量和智能方面产生了良好的表现。