Frame-online speech enhancement systems in the short-time Fourier transform (STFT) domain usually have an algorithmic latency equal to the window size due to the use of overlap-add in the inverse STFT (iSTFT). This algorithmic latency allows the enhancement models to leverage future contextual information up to a length equal to the window size. However, this information is only partially leveraged by current frame-online systems. To fully exploit it, we propose an overlapped-frame prediction technique for deep learning based frame-online speech enhancement, where at each frame our deep neural network (DNN) predicts the current and several past frames that are necessary for overlap-add, instead of only predicting the current frame. In addition, we propose a loss function to account for the scale difference between predicted and oracle target signals. Experiments on a noisy-reverberant speech enhancement task show the effectiveness of the proposed algorithms.
翻译:短时 Fourier 变换域( STFT) 的框架中线语音增强系统通常具有与窗口大小相等的算法时长, 这是因为在反向STFT( iSTFT) 中使用了重叠添加。 这种算法时长允许增强模型利用未来背景信息, 其长度与窗口大小相同。 但是, 这些信息只是部分被当前框架线上系统所利用。 为了充分利用这些信息, 我们提议了一种基于深学习的基于框架的语音增强的重叠框架预测技术, 在每个框架里, 我们深神经网络( DNN) 预测的是当前和过去一些框架的重叠添加, 而不是仅仅预测当前框架。 此外, 我们提议了一个损失函数, 来计算预测与目标信号之间的比例差异 。 有关噪音反动语音增强任务的实验显示了拟议算法的有效性 。