Complex-valued processing has brought deep learning-based speech enhancement and signal extraction to a new level. Typically, the process is based on a time-frequency (TF) mask which is applied to a noisy spectrogram, while complex masks (CM) are usually preferred over real-valued masks due to their ability to modify the phase. Recent work proposed to use a complex filter instead of a point-wise multiplication with a mask. This allows to incorporate information from previous and future time steps exploiting local correlations within each frequency band. In this work, we propose DeepFilterNet, a two stage speech enhancement framework utilizing deep filtering. First, we enhance the spectral envelope using ERB-scaled gains modeling the human frequency perception. The second stage employs deep filtering to enhance the periodic components of speech. Additionally to taking advantage of perceptual properties of speech, we enforce network sparsity via separable convolutions and extensive grouping in linear and recurrent layers to design a low complexity architecture. We further show that our two stage deep filtering approach outperforms complex masks over a variety of frequency resolutions and latencies and demonstrate convincing performance compared to other state-of-the-art models.
翻译:复杂估价的处理使基于深层次学习的语音增强和信号提取达到一个新的水平。 通常,这一过程基于一个时间频率(TF)遮罩,用于噪音光谱图,而复杂的遮罩由于能够改变阶段而通常优于具有实际价值的遮罩。 最近的工作提议使用复杂的过滤器,而不是使用带有掩罩的点度乘法。这样就可以将利用每个频段内当地关联的以往和今后时间步骤的信息纳入其中。 在这项工作中,我们提议使用深过滤器,这是一个两个阶段的加强语音框架。 首先,我们利用ERB尺度的增益模型模拟人类频率感知,加强光谱封封。 第二阶段采用深度过滤法来增强定期语音组成部分。除了利用语音的感知性特性外,我们还通过分解的演算和在线性与经常性层的广泛组合,加强网络的紧张性,以设计一个低复杂度的结构。 我们进一步表明,我们两个阶段的深过滤方法超越了各种频率分辨率和迟误差的复杂面罩,并展示了与其他状态模型相比令人信服的性。