Time-domain speech enhancement (SE) has recently been intensively investigated. Among recent works, DEMUCS introduces multi-resolution STFT loss to enhance performance. However, some resolutions used for STFT contain non-stationary signals, and it is challenging to learn multi-resolution frequency losses simultaneously with only one output. For better use of multi-resolution frequency information, we supplement multiple spectrograms in different frame lengths into the time-domain encoders. They extract stationary frequency information in both narrowband and wideband. We also adopt multiple decoder outputs, each of which computes its corresponding resolution frequency loss. Experimental results show that (1) it is more effective to fuse stationary frequency features than non-stationary features in the encoder, and (2) the multiple outputs consistent with the frequency loss improve performance. Experiments on the Voice-Bank dataset show that the proposed method obtained a 0.14 PESQ improvement.
翻译:最近时间域语音增强(SE)得到了密切研究,其中 DEMUCS 引入多分辨率 STFT 损失以提高性能。然而,用于 STFT 的一些分辨率包含非稳态信号,仅使用一个输出同时学习多分辨率频率损失具有挑战性。为更好地利用多分辨率频率信息,我们在时域编码器中补充多个具有不同帧长度的频谱图。它们提取窄带和宽带的稳态频率信息。我们还采用多个解码器输出,每个输出计算其相应的分辨率频率损失。实验结果表明:(1)在编码器中融合稳态频率特征比非稳态特征更有效,(2)与频率损失一致的多个输出提高了性能。在 Voice-Bank 数据集上的实验表明,所提出的方法获得了 0.14 PESQ 的改进。