Deep neural network (DNN)-based speech enhancement usually uses a clean speech as a training target. However, it is hard to collect large amounts of clean speech because the recording is very costly. In other words, the performance of current speech enhancement has been limited by the amount of training data. To relax this limitation, Noisy-target Training (NyTT) that utilizes noisy speech as a training target has been proposed. Although it has been experimentally shown that NyTT can train a DNN without clean speech, a detailed analysis has not been conducted and its behavior has not been understood well. In this paper, we conduct various analyses to deepen our understanding of NyTT. In addition, based on the property of NyTT, we propose a refined method that is comparable to the method using clean speech. Furthermore, we show that we can improve the performance by using a huge amount of noisy speech with clean speech.
翻译:深神经网络(DNN)的语音强化通常使用清洁语言作为培训目标。 但是,很难收集大量清洁语言,因为录音成本很高。 换句话说, 当前的语音强化工作受到培训数据数量的限制。 为了放松这一限制, 提出了使用吵闹言论作为培训目标的噪音目标培训。 尽管实验性地表明NyTT可以在没有清洁语言的情况下培训DNN, 但没有进行详细分析,也没有很好地理解它的行为。 在本文中,我们进行了各种分析,以加深我们对NYTT的理解。 此外,根据NyTT的特性,我们提出了一种与使用清洁语言的方法相类似的改良方法。 此外,我们表明我们可以通过使用大量使用清洁语言的吵闹言论来改进性能。