Most speech enhancement (SE) models learn a point estimate, and do not make use of uncertainty estimation in the learning process. In this paper, we show that modeling heteroscedastic uncertainty by minimizing a multivariate Gaussian negative log-likelihood (NLL) improves SE performance at no extra cost. During training, our approach augments a model learning complex spectral mapping with a temporary submodel to predict the covariance of the enhancement error at each time-frequency bin. Due to unrestricted heteroscedastic uncertainty, the covariance introduces an undersampling effect, detrimental to SE performance. To mitigate undersampling, our approach inflates the uncertainty lower bound and weights each loss component with their uncertainty, effectively compensating severely undersampled components with more penalties. Our multivariate setting reveals common covariance assumptions such as scalar and diagonal matrices. By weakening these assumptions, we show that the NLL achieves superior performance compared to popular losses including the mean squared error (MSE), mean absolute error (MAE), and scale-invariant signal-to-distortion ratio (SI-SDR).
翻译:多数语言增强模型学习了点估计, 并且没有在学习过程中使用不确定性估计。 在本文中, 我们显示通过将多变量高萨负日志相似性( NLL) 最小化, 模型化的偏差性不确定性会提高SE的性能, 且不增加额外的成本。 在培训过程中, 我们的方法增加了模型化的学习复合光谱绘图, 并有一个临时的子模型, 以预测每个时频中增强误差的共变性。 由于无限制的混血性不确定性, 共变会引入一个低比抽样效应, 不利于SE的性能。 为了减轻抽取不足, 我们的方法将每个损失组成部分的不确定性缩小下限和重量, 有效地用更多的惩罚来补偿严重不足的成分。 我们的多变量设置揭示了常见的共变数假设, 如变数和对数矩阵。 我们通过削弱这些假设, 显示 NLLL 与普通损失相比, 包括平均平方差( MSE)、 表示绝对误(MAE) 和比例式信号- drition- drictionion (SI- drition) ) 。