Recent work in the domain of speech enhancement has explored the use of self-supervised speech representations to aid in the training of neural speech enhancement models. However, much of this work focuses on using the deepest or final outputs of self supervised speech representation models, rather than the earlier feature encodings. The use of self supervised representations in such a way is often not fully motivated. In this work it is shown that the distance between the feature encodings of clean and noisy speech correlate strongly with psychoacoustically motivated measures of speech quality and intelligibility, as well as with human Mean Opinion Score (MOS) ratings. Experiments using this distance as a loss function are performed and improved performance over the use of STFT spectrogram distance based loss as well as other common loss functions from speech enhancement literature is demonstrated using objective measures such as perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI).
翻译:近来在增强语言能力方面的工作探索了使用自我监督的语音表达方式,以帮助培训增强神经语言模式,但是,许多这项工作侧重于使用自我监督的语音表述模式的最深层或最终产出,而不是早期的特征编码。使用自我监督的表述方式往往没有完全动机。在这项工作中,清洁和吵闹的语音特征编码之间的距离与具有心理动机的语音质量和可感应措施以及人类平均意见评分(MOS)的评级密切相关。利用这种距离作为损失函数的实验进行,并采用客观措施,如对语言质量进行感知性评价(PESQ)和短期目标智能(STOI)等客观措施,证明在使用STFT光谱远程损失以及增强语言文学的其他常见损失功能方面提高了绩效。</s>