This paper proposes a full-band and sub-band fusion model, named as FullSubNet, for single-channel real-time speech enhancement. Full-band and sub-band refer to the models that input full-band and sub-band noisy spectral feature, output full-band and sub-band speech target, respectively. The sub-band model processes each frequency independently. Its input consists of one frequency and several context frequencies. The output is the prediction of the clean speech target for the corresponding frequency. These two types of models have distinct characteristics. The full-band model can capture the global spectral context and the long-distance cross-band dependencies. However, it lacks the ability to modeling signal stationarity and attending the local spectral pattern. The sub-band model is just the opposite. In our proposed FullSubNet, we connect a pure full-band model and a pure sub-band model sequentially and use practical joint training to integrate these two types of models' advantages. We conducted experiments on the DNS challenge (INTERSPEECH 2020) dataset to evaluate the proposed method. Experimental results show that full-band and sub-band information are complementary, and the FullSubNet can effectively integrate them. Besides, the performance of the FullSubNet also exceeds that of the top-ranked methods in the DNS Challenge (INTERSPEECH 2020).
翻译:本文提出一个全频和子频带融合模型,称为FullSubNet,用于加强单一频道实时语音。全频和子频带分别指输入全频和亚频段噪音频谱特征、输出全频和子频段语音目标的模型。次频模式独立处理每个频率。其输入由一个频率和几个上下文频率组成。产出是预测相应频率的清洁语音目标。这两类模型具有不同的特点。全频和子频段模型可以捕捉全球光谱背景和长距离跨频段依赖性。然而,全频和亚频频段分别指输入全频和亚频段噪音频谱特征、输出全频段和亚频段语音演讲目标的模型。在拟议的全频和亚频模式中,我们将一个纯全频和纯次频段模型按顺序连接在一起,并使用实用的联合培训整合这两种类型的模型的优势。我们进行了DNS挑战(INERSECH 2020) 的实验,以评估拟议方法。实验结果显示,全频段和次频段模式的运行方式能够有效地纳入2020年的完整和亚频域网络的顶端数据。