This paper tackles the problem of the heavy dependence of clean speech data required by deep learning based audio-denoising methods by showing that it is possible to train deep speech denoising networks using only noisy speech samples. Conventional wisdom dictates that in order to achieve good speech denoising performance, there is a requirement for a large quantity of both noisy speech samples and perfectly clean speech samples, resulting in a need for expensive audio recording equipment and extremely controlled soundproof recording studios. These requirements pose significant challenges in data collection, especially in economically disadvantaged regions and for low resource languages. This work shows that speech denoising deep neural networks can be successfully trained utilizing only noisy training audio. Furthermore it is revealed that such training regimes achieve superior denoising performance over conventional training regimes utilizing clean training audio targets, in cases involving complex noise distributions and low Signal-to-Noise ratios (high noise environments). This is demonstrated through experiments studying the efficacy of our proposed approach over both real-world noises and synthetic noises using the 20 layered Deep Complex U-Net architecture.
翻译:本文解决了深层学习的音频消沉方法要求的清洁言语数据严重依赖的问题,它表明,仅使用吵闹的音频样本,就可以对深层言语分解网络进行培训; 常规智慧要求,为了实现良好的言语分解功能,需要大量吵闹的言语样本和完全干净的言语样本,从而需要昂贵的录音设备和极受控制的隔音录音室; 这些要求对数据收集构成重大挑战,特别是在经济条件不利地区和低资源语言方面。 这项工作表明,只有使用吵闹的培训音频,才能成功地对深层神经网络进行言辞分解培训。 此外,还表明,在涉及复杂的噪声传播和低信号对噪音比率(高噪音环境)的情况下,这类培训制度能够利用清洁的音频目标,在常规培训制度下实现优优异的言分解功能,而使用清洁的培训音频目标,通过对20层深层复合U-Net结构研究我们所提议的方法对现实世界噪音和合成噪音的功效。