Speech enhancement (SE) aims to improve speech quality and intelligibility, which are both related to a smooth transition in speech segments that may carry linguistic information, e.g. phones and syllables. In this study, we propose a novel phone-fortified perceptual loss (PFPL) that takes phonetic information into account for training SE models. To effectively incorporate the phonetic information, the PFPL is computed based on latent representations of the wav2vec model, a powerful self-supervised encoder that renders rich phonetic information. To more accurately measure the distribution distances of the latent representations, the PFPL adopts the Wasserstein distance as the distance measure. Our experimental results first reveal that the PFPL is more correlated with the perceptual evaluation metrics, as compared to signal-level losses. Moreover, the results showed that the PFPL can enable a deep complex U-Net SE model to achieve highly competitive performance in terms of standardized quality and intelligibility evaluations on the Voice Bank-DEMAND dataset.
翻译:增强语音(SE)的目的是提高语言质量和智能,这与可能包含语言信息(例如电话和音频)的语音部分的平稳过渡有关。在本研究中,我们建议采用新型的电话强化感知损失(PPPL),在培训SE模型时考虑到语音信息。为了有效地纳入语音信息,PFPL是根据Wav2vec模型的潜表(Wav2vec模型是一个强大的自我监督编码器,能提供丰富的语音信息)计算的。为了更准确地衡量潜在表达器的分布距离,PFPL采用瓦瑟斯坦距离作为距离测量标准。我们的实验结果首先显示,PFPL与感知评价指标相比,与信号级损失更加相关。此外,结果显示PFPL能够使U-Net SE模型在语音银行-DEMAND数据集的标准化质量和智能评价方面实现高度竞争性的业绩。