Training of speech enhancement systems often does not incorporate knowledge of human perception and thus can lead to unnatural sounding results. Incorporating psychoacoustically motivated speech perception metrics as part of model training via a predictor network has recently gained interest. However, the performance of such predictors is limited by the distribution of metric scores that appear in the training data. In this work, we propose MetricGAN+/- (an extension of MetricGAN+, one such metric-motivated system) which introduces an additional network - a "de-generator" which attempts to improve the robustness of the prediction network (and by extension of the generator) by ensuring observation of a wider range of metric scores in training. Experimental results on the VoiceBank-DEMAND dataset show relative improvement in PESQ score of 3.8% (3.05 vs 3.22 PESQ score), as well as better generalisation to unseen noise and speech.
翻译:语言强化系统的培训往往没有纳入人类感知知识,因此可能导致异常的探测结果。将具有心理动机的言语感知计量作为通过预测网络进行示范培训的一部分,最近引起了人们的兴趣。然而,这种预测器的性能受到培训数据中显示的计量分数分布的限制。在这项工作中,我们提议MetriGAN+/-(MetriGAN+的扩展,一个这样的计量驱动系统)引入一个额外的网络――一个“非生成器”,通过确保在培训中观测更广泛的计量分数,试图改善预测网络的稳健性(以及发电机的扩展)。“VoiceBank-DEMAND”数据集的实验结果表明,PESQ的得分相对提高了3.8%(3.05比3.22 PESQ得分),并更好地概括了隐蔽噪音和言语。