Most speech enhancement (SE) models learn a point estimate and do not make use of uncertainty estimation in the learning process. In this paper, we show that modeling heteroscedastic uncertainty by minimizing a multivariate Gaussian negative log-likelihood (NLL) improves SE performance at no extra cost. During training, our approach augments a model learning complex spectral mapping with a temporary submodel to predict the covariance of the enhancement error at each time-frequency bin. Due to unrestricted heteroscedastic uncertainty, the covariance introduces an undersampling effect, detrimental to SE performance. To mitigate undersampling, our approach inflates the uncertainty lower bound and weights each loss component with their uncertainty, effectively compensating severely undersampled components with more penalties. Our multivariate setting reveals common covariance assumptions such as scalar and diagonal matrices. By weakening these assumptions, we show that the NLL achieves superior performance compared to popular loss functions including the mean squared error (MSE), mean absolute error (MAE), and scale-invariant signal-to-distortion ratio (SI-SDR).
翻译:多数语言增强模型学习了点估计, 并且没有在学习过程中使用不确定性估计。 在本文中, 我们显示通过将多变量高萨负日志相似性( NLL) 最小化, 模型显示的超异性不确定性能够提高SE的性能。 在培训过程中, 我们的方法增加了模型学习复合光谱绘图, 使用一个临时的子模型, 以预测每个时频中增强误差的共变性。 由于无限制的超异性不确定性, 共变性引入了一种下标效果, 不利于SE的性能 。 为了减轻下标, 我们的方法将每个损失部分的不确定性下限和重量与不确定性相加, 有效地用更多的惩罚来补偿严重不足的元件。 我们的多变性设置揭示了常见的共变性假设, 如变式和对立式矩阵。 我们通过削弱这些假设, 显示 NLL 与普通损失函数相比性优, 包括平均平方差( MSE)、 表示的绝对误差(MAE) 和比例式信号- drition- drictionion (SI- drition) ) 。</s>