Recent studies in neural network-based monaural speech separation (SS) have achieved a remarkable success thanks to increasing ability of long sequence modeling. However, they would degrade significantly when put under realistic noisy conditions, as the background noise could be mistaken for speaker's speech and thus interfere with the separated sources. To alleviate this problem, we propose a novel network to unify speech enhancement and separation with gradient modulation to improve noise-robustness. Specifically, we first build a unified network by combining speech enhancement (SE) and separation modules, with multi-task learning for optimization, where SE is supervised by parallel clean mixture to reduce noise for downstream speech separation. Furthermore, in order to avoid suppressing valid speaker information when reducing noise, we propose a gradient modulation (GM) strategy to harmonize the SE and SS tasks from optimization view. Experimental results show that our approach achieves the state-of-the-art on large-scale Libri2Mix- and Libri3Mix-noisy datasets, with SI-SNRi results of 16.0 dB and 15.8 dB respectively. Our code is available at GitHub.
翻译:最近对神经网络的音调分离(SS)的研究由于长序建模能力的提高而取得了显著的成功,但是,如果在现实的噪音条件下进行,它们将大大降低,因为背景噪音可能会被错误地误认为发言者的言语,从而干扰分离的来源。为了缓解这一问题,我们提议建立一个新型网络,用梯度调制来统一语音增强和分离,以改善噪音-紫色。具体地说,我们首先通过将语音增强模块和分离模块结合起来,并结合多任务学习优化,从而建立统一的网络,使SE受到平行清洁混合物的监督,以减少下游言语分离的噪音。此外,为了避免在减少噪音时压制有效的扬声器信息,我们提议了一个梯度调制(GM)战略,以协调SE和SS的任务。实验结果表明,我们的方法在大型Libri2Mix和Libri3Mix-nois数据集上达到了最新水平,而SI-SRi结果分别为16.0 dB和15.8 dB。我们的代码可在GiHub获得。