Despite rapid advancement in recent years, current speech enhancement models often produce speech that differs in perceptual quality from real clean speech. We propose a learning objective that formalizes differences in perceptual quality, by using domain knowledge of acoustic-phonetics. We identify temporal acoustic parameters -- such as spectral tilt, spectral flux, shimmer, etc. -- that are non-differentiable, and we develop a neural network estimator that can accurately predict their time-series values across an utterance. We also model phoneme-specific weights for each feature, as the acoustic parameters are known to show different behavior in different phonemes. We can add this criterion as an auxiliary loss to any model that produces speech, to optimize speech outputs to match the values of clean speech in these features. Experimentally we show that it improves speech enhancement workflows in both time-domain and time-frequency domain, as measured by standard evaluation metrics. We also provide an analysis of phoneme-dependent improvement on acoustic parameters, demonstrating the additional interpretability that our method provides. This analysis can suggest which features are currently the bottleneck for improvement.
翻译:尽管近年来取得了迅速的进展,但目前的语音增强模型往往产生与真实清洁的语音在感知质量上不同的声音。我们提出了一个学习目标,通过使用声频-语音学领域知识,将感知质量的差异正式化。我们确定了非差别性的时间声学参数,如光度倾斜、光谱通量、闪烁等,我们开发了一个神经网络估计器,可以准确预测其跨语句的时间序列值。我们还为每个特征制作了电话特定重量模型,因为声学参数已知在不同电话中显示不同行为。我们可以将这一标准作为辅助损失添加到任何生成语音的模型中,优化语音输出以适应这些特征中清洁言语的价值。我们实验性地表明,它改进了时间-空间和时频域的语音增强工作流程,用标准评价指标衡量。我们还分析了对声学参数的视线性改进,展示了我们的方法所提供的额外解释性。我们的分析可以表明,目前哪些特征是改进的瓶颈。