While deep learning based speech enhancement systems have made rapid progress in improving the quality of speech signals, they can still produce outputs that contain artifacts and can sound unnatural. We propose a novel approach to speech enhancement aimed at improving perceptual quality and naturalness of enhanced signals by optimizing for key characteristics of speech. We first identify key acoustic parameters that have been found to correlate well with voice quality (e.g. jitter, shimmer, and spectral flux) and then propose objective functions which are aimed at reducing the difference between clean speech and enhanced speech with respect to these features. The full set of acoustic features is the extended Geneva Acoustic Parameter Set (eGeMAPS), which includes 25 different attributes associated with perception of speech. Given the non-differentiable nature of these feature computation, we first build differentiable estimators of the eGeMAPS and then use them to fine-tune existing speech enhancement systems. Our approach is generic and can be applied to any existing deep learning based enhancement systems to further improve the enhanced speech signals. Experimental results conducted on the Deep Noise Suppression (DNS) Challenge dataset shows that our approach can improve the state-of-the-art deep learning based enhancement systems.
翻译:虽然深层学习的语音增强系统在提高语音信号质量方面取得了快速进展,但它们仍然可以产生含有文物且听上去不自然的输出物。我们提出一种新型的语音增强方法,旨在通过优化语音的关键特征,提高增强信号的感知质量和自然性。我们首先确定被认为与语音质量(例如音节、闪烁和光谱通量)密切相关的关键声学参数,然后提出客观功能,旨在缩小清洁言语与强化言语在这些特征上的区别。全套声学特征是扩展的日内瓦声学参数集(eGEMAPS),其中包括与语音感知有关的25个不同属性。鉴于这些特性的计算无差别性质,我们首先建立电子声学参数的可区别性估算器,然后将其用于微调现有语音增强系统。我们的方法是通用的,可以适用于任何现有的基于深层学习的增强语音信号系统,以进一步改进强化的语音信号。在深噪音抑制(DNS)挑战数据集上进行的实验结果显示,我们的方法可以改进基于状态学习的系统。