The lack of clean speech is a practical challenge to the development of speech enhancement systems, which means that the training of neural network models must be done in an unsupervised manner, and there is an inevitable mismatch between their training criterion and evaluation metric. In response to this unfavorable situation, we propose a teacher-student training strategy that does not require any subjective/objective speech quality metrics as learning reference by improving the previously proposed noisy-target training (NyTT). Because homogeneity between in-domain noise and extraneous noise is the key to the effectiveness of NyTT, we train various student models by remixing the teacher model's estimated speech and noise for clean-target training or raw noisy speech and the teacher model's estimated noise for noisy-target training. We use the NyTT model as the initial teacher model. Experimental results show that our proposed method outperforms several baselines, especially with two-stage inference, where clean speech is derived successively through the bootstrap model and the final student model.
翻译:缺乏清洁言语是发展语言强化系统的一个实际挑战,这意味着神经网络模型的培训必须以不受监督的方式进行,而且其培训标准与评估指标之间不可避免地存在不匹配。 针对这种不利的情况,我们提议师生培训战略,不要求任何主观/客观的言语质量指标作为学习参考,改进先前提议的噪音目标培训(NyTT ) 。由于日常噪音和外来噪音之间的同质性是NyTT有效性的关键,因此我们培训各种学生模型,方法是将教师模型估计的言语和噪音重新组合起来,用于清洁目标培训或原始噪音演讲,以及教师模型估计的噪音用于噪音培训。我们使用NyTT模式作为初始教师模型。实验结果表明,我们拟议的方法超越了几个基线,特别是两个阶段的推论,在这里,清洁的言语通过靴模型和最后一个学生模型相继而产生。