Diffusion-based speech enhancement (SE) models need to incorporate correct prior knowledge as reliable conditions to generate accurate predictions. However, providing reliable conditions using noisy features is challenging. One solution is to use features enhanced by deterministic methods as conditions. However, the information distortion and loss caused by deterministic methods might affect the diffusion process. In this paper, we first investigate the effects of using different deterministic SE models as conditions for diffusion. We validate two conditions depending on whether the noisy feature was used as part of the condition: one using only the deterministic feature (deterministic-only), and the other using both deterministic and noisy features (deterministic-noisy). Preliminary investigation found that using deterministic enhanced conditions improves hearing experiences on real data, while the choice between using deterministic-only or deterministic-noisy conditions depends on the deterministic models. Based on these findings, we propose a dual-streaming encoding Repair-Diffusion Model for SE (DERDM-SE) to more effectively utilize both conditions. Moreover, we found that fine-grained deterministic models have greater potential in objective evaluation metrics, while UNet-based deterministic models provide more stable diffusion performance. Therefore, in the DERDM-SE, we propose a deterministic model that combines coarse- and fine-grained processing. Experimental results on CHiME4 show that the proposed models effectively leverage deterministic models to achieve better SE evaluation scores, along with more stable performance compared to other diffusion-based SE models.
翻译:基于扩散的语音增强模型需要融入正确的先验知识作为可靠条件以生成准确预测。然而,利用含噪特征提供可靠条件具有挑战性。一种解决方案是使用确定性方法增强后的特征作为条件,但确定性方法导致的信息失真与损失可能影响扩散过程。本文首先探究了使用不同确定性语音增强模型作为扩散条件的效果。我们验证了两种条件设置——根据是否将含噪特征作为条件的一部分:一种仅使用确定性特征(纯确定性条件),另一种同时使用确定性特征与含噪特征(确定性-含噪条件)。初步研究发现,使用确定性增强条件能提升真实数据上的听觉体验,而纯确定性条件与确定性-含噪条件的选择取决于所采用的确定性模型。基于这些发现,我们提出一种用于语音增强的双流编码修复-扩散模型,以更有效地利用两种条件。此外,我们发现细粒度确定性模型在客观评估指标上更具潜力,而基于UNet的确定性模型能提供更稳定的扩散性能。因此,在双流编码修复-扩散模型中,我们提出一种结合粗粒度与细粒度处理的确定性模型。在CHiME4数据集上的实验结果表明,所提模型能有效利用确定性模型获得更优的语音增强评估分数,且相较于其他基于扩散的语音增强模型具有更稳定的性能。