Supervised learning based on a deep neural network recently has achieved substantial improvement on speech enhancement. Denoising networks learn mapping from noisy speech to clean one directly, or to a spectrum mask which is the ratio between clean and noisy spectra. In either case, the network is optimized by minimizing mean square error (MSE) between ground-truth labels and time-domain or spectrum output. However, existing schemes have either of two critical issues: spectrum and metric mismatches. The spectrum mismatch is a well known issue that any spectrum modification after short-time Fourier transform (STFT), in general, cannot be fully recovered after inverse short-time Fourier transform (ISTFT). The metric mismatch is that a conventional MSE metric is sub-optimal to maximize our target metrics, signal-to-distortion ratio (SDR) and perceptual evaluation of speech quality (PESQ). This paper presents a new end-to-end denoising framework with the goal of joint SDR and PESQ optimization. First, the network optimization is performed on the time-domain signals after ISTFT to avoid spectrum mismatch. Second, two loss functions which have improved correlations with SDR and PESQ metrics are proposed to minimize metric mismatch. The experimental result showed that the proposed denoising scheme significantly improved both SDR and PESQ performance over the existing methods.
翻译:基于深层神经网络的监管学习最近大大改进了语言强化。 低调网络学会从吵闹的演讲到直接清洁,或者从光谱面罩(即清洁和吵闹的光谱比重)进行绘图,无论哪种情况,网络都是通过尽量减少地面真实标签与时间-空间或频谱输出之间的平均平方差(MSE)优化的。然而,现有计划有两个关键问题:频谱和计量错配。频谱错配是一个众所周知的问题,在短时间Fourier变换(STFT)之后,任何频谱的修改一般都无法在短时间的Fourier变换(ISTFT)之后完全恢复。 衡量错配是常规的MSE衡量指标对于最大限度地提高我们的目标指标、信号-扭曲率(SDR)和对语音质量(PESQ)的感知性评估(PESQ)之间的平均差差差差差差差差差差(MSE)。 本文提出了一个新的端对端到端断局框架,目标是在ISTFTF为避免频差的短期变错(IQ)之后,网络优化在时间- 后进行时间-时间信号调整,以避免频变换配错。第二,改进了SIMISIMFRM(IM) 和拟议的两个实验性平比差结果。 改进了现有的实验性能计划。 改进了现有的改进了SDR) 改进后,改进了SM(R) 改进了SIMFR) 改进了现有的试验结果。</s>