Contemporary speech enhancement predominantly relies on audio transforms that are trained to reconstruct a clean speech waveform. The development of high-performing neural network sound recognition systems has raised the possibility of using deep feature representations as 'perceptual' losses with which to train denoising systems. We explored their utility by first training deep neural networks to classify either spoken words or environmental sounds from audio. We then trained an audio transform to map noisy speech to an audio waveform that minimized the difference in the deep feature representations between the output audio and the corresponding clean audio. The resulting transforms removed noise substantially better than baseline methods trained to reconstruct clean waveforms, and also outperformed previous methods using deep feature losses. However, a similar benefit was obtained simply by using losses derived from the filter bank inputs to the deep networks. The results show that deep features can guide speech enhancement, but suggest that they do not yet outperform simple alternatives that do not involve learned features.
翻译:现代语音增强主要依靠经过培训的用于重建清洁语音波形的音频变换。 高性能神经网络声学识别系统的开发提高了使用深度地貌表现为“ 感知” 损失来培训拆卸系统的可能性。 我们首先通过训练深层神经网络来将声音或环境声音从音频中分类来探索其效用。 然后我们训练了音频变换,将吵闹的言语映射成音频波形,从而将输出音频和相应的清洁音频之间的深度地貌差异降到最低。 由此产生的变换比重建清洁波形的基线方法要好得多,也比利用深度地貌损失的原始方法要好得多。 然而,仅仅利用过滤银行输入到深层网络中的损失就取得了类似的好处。 结果显示,深层的特征可以引导语音增强,但表明它们还没有超越不需要学习的特性的简单替代方法。