Deep learning based speech enhancement in the short-term Fourier transform (STFT) domain typically uses a large window length such as 32 ms. A larger window contains more samples and the frequency resolution can be higher for potentially better enhancement. This however incurs an algorithmic latency of 32 ms in an online setup, because the overlap-add algorithm used in the inverse STFT (iSTFT) is also performed based on the same 32 ms window size. To reduce this inherent latency, we adapt a conventional dual window size approach, where a regular input window size is used for STFT but a shorter output window is used for the overlap-add in the iSTFT, for STFT-domain deep learning based frame-online speech enhancement. Based on this STFT and iSTFT configuration, we employ single- or multi-microphone complex spectral mapping for frame-online enhancement, where a deep neural network (DNN) is trained to predict the real and imaginary (RI) components of target speech from the mixture RI components. In addition, we use the RI components predicted by the DNN to conduct frame-online beamforming, the results of which are then used as extra features for a second DNN to perform frame-online post-filtering. The frequency-domain beamforming in between the two DNNs can be easily integrated with complex spectral mapping and is designed to not incur any algorithmic latency. Additionally, we propose a future-frame prediction technique to further reduce the algorithmic latency. Evaluation results on a noisy-reverberant speech enhancement task demonstrate the effectiveness of the proposed algorithms. Compared with Conv-TasNet, our STFT-domain system can achieve better enhancement performance for a comparable amount of computation, or comparable performance with less computation, maintaining strong performance at an algorithmic latency as low as 2 ms.
翻译:短期 Fleier 变换( STFT) 域的深学习语言强化通常使用大型窗口长度, 如 32 ms 。 更大的窗口包含更多的样本, 频率分辨率可能更高, 从而有可能得到更好的改进。 然而, 这在在线设置中会产生32 ms 的算法延迟度, 因为对面 STFT ( iSTFT) 使用的重叠增加算法也是基于相同的32 ms 窗口大小。 为了减少这种内在的悬浮, 我们调整了常规双向双向窗口尺寸, 用于STFT 常规输入窗口大小, 用于STFT, 但用于在 iSTFT 和 iSTFT 设置中进行重叠增加频率。 以这个 STFT 和 iSTFT 配置为基础, 我们使用单式或多声频复合的光谱绘图, 深度的神经网络( DNNNF) 能够用混合的变价变价数据进一步预测 。 此外, 我们使用由 DNNF 所预测的 Ralder 运算算算算出一个更精确的系统, 变价化后, 变式系统可以显示一个比 格式 格式 变价变价 。