Speech enhancement models have greatly progressed in recent years, but still show limits in perceptual quality of their speech outputs. We propose an objective for perceptual quality based on temporal acoustic parameters. These are fundamental speech features that play an essential role in various applications, including speaker recognition and paralinguistic analysis. We provide a differentiable estimator for four categories of low-level acoustic descriptors involving: frequency-related parameters, energy or amplitude-related parameters, spectral balance parameters, and temporal features. Unlike prior work that looks at aggregated acoustic parameters or a few categories of acoustic parameters, our temporal acoustic parameter (TAP) loss enables auxiliary optimization and improvement of many fine-grain speech characteristics in enhancement workflows. We show that adding TAPLoss as an auxiliary objective in speech enhancement produces speech with improved perceptual quality and intelligibility. We use data from the Deep Noise Suppression 2020 Challenge to demonstrate that both time-domain models and time-frequency domain models can benefit from our method.
翻译:近些年来,语音增强模型取得了很大进展,但仍显示其语音输出的感知质量有限。我们根据时间声学参数提出了感知质量目标。这些是基本语言特征,在各种应用中起着重要作用,包括语音识别和语言分析。我们为四类低声描述器提供了可区别的估测器,其中涉及:频率相关参数、能量或振幅相关参数、光谱平衡参数和时间特征。与以往研究综合声学参数或几类声学参数的工作不同,我们的时间声学参数的丧失有助于辅助性优化和改进增强工作流程中许多微小语言特征。我们显示,在语音增强中添加TAPLO作为辅助目标,可以提高感知质量和智能性。我们使用2020年深噪音抑制挑战的数据来证明,时间模型和时频域模型都可以受益于我们的方法。