Deep learning based speech enhancement in the short-time Fourier transform (STFT) domain typically uses a large window length such as 32 ms. A larger window can lead to higher frequency resolution and potentially better enhancement. This however incurs an algorithmic latency of 32 ms in an online setup, because the overlap-add algorithm used in the inverse STFT (iSTFT) is also performed using the same window size. To reduce this inherent latency, we adapt a conventional dual-window-size approach, where a regular input window size is used for STFT but a shorter output window is used for overlap-add, for STFT-domain deep learning based frame-online speech enhancement. Based on this STFT-iSTFT configuration, we employ complex spectral mapping for frame-online enhancement, where a deep neural network (DNN) is trained to predict the real and imaginary (RI) components of target speech from the mixture RI components. In addition, we use the DNN-predicted RI components to conduct frame-online beamforming, the results of which are used as extra features for a second DNN to perform frame-online post-filtering. The frequency-domain beamformer can be easily integrated with our DNNs and is designed to not incur any algorithmic latency. Additionally, we propose a future-frame prediction technique to further reduce the algorithmic latency. Evaluation on noisy-reverberant speech enhancement shows the effectiveness of the proposed algorithms. Compared with Conv-TasNet, our STFT-domain system can achieve better enhancement performance for a comparable amount of computation, or comparable performance with less computation, maintaining strong performance at an algorithmic latency as low as 2 ms.
翻译:在短时间 Fleier 变换域( STFT) 的深层次语音强化中, 深层次学习基于深频解析, 并有可能得到更好的改进。 但是, 更大的窗口可以在在线设置中导致高频解析, 并可能导致高频解析, 因为在 STFT ( iSTFT) 中使用的重叠加算法也使用相同的窗口大小。 为了减少这种内在的双窗口尺寸, 我们调整了常规的双窗口尺寸, 即用于STFT 的常规输入窗口大小, 但用于重叠添加, 使用较短的输出窗口。 用于STFT- 深层次学习基于框架- 在线语音强化。 然而, 在 STFT- iSTFT 配置中, 需要32 ms 的算法性拉长。 而在 深神经网络 ( INNF) 中, 将目标演讲中真实和想象的( RI) 部分作为混合物 RI 的 。 此外, 我们使用 DNNF 内部语言变换 格式, 的计算结果可以用来进行更精确的系统 升级 。