Speech enhancement has seen great improvement in recent years using end-to-end neural networks. However, most models are agnostic to the spoken phonetic content. Recently, several studies suggested phonetic-aware speech enhancement, mostly using perceptual supervision. Yet, injecting phonetic features during model optimization can take additional forms (e.g., model conditioning). In this paper, we conduct a systematic comparison between different methods of incorporating phonetic information in a speech enhancement model. By conducting a series of controlled experiments, we observe the influence of different phonetic content models as well as various feature-injection techniques on enhancement performance, considering both causal and non-causal models. Specifically, we evaluate three settings for injecting phonetic information, namely: i) feature conditioning; ii) perceptual supervision; and iii) regularization. Phonetic features are obtained using an intermediate layer of either a supervised pre-trained Automatic Speech Recognition (ASR) model or by using a pre-trained Self-Supervised Learning (SSL) model. We further observe the effect of choosing different embedding layers on performance, considering both manual and learned configurations. Results suggest that using a SSL model as phonetic features outperforms the ASR one in most cases. Interestingly, the conditioning setting performs best among the evaluated configurations.
翻译:近些年来,利用端到端的神经网络,语音增强情况有了很大的改善,然而,大多数模型对口语内容具有不可知性。最近,一些研究建议加强语音语音,主要使用感官监督。然而,在模型优化过程中注射语音特征可以采取更多形式(例如,模式调节)。在本文中,我们系统地比较了将语音信息纳入语音增强模式的不同方法。通过进行一系列受控实验,我们观察到了不同语音内容模型以及各种特性注射技术对增强性能的影响,同时考虑到因果关系和非因果关系模型。具体地说,我们评估了三种注射语音信息设置,即:一)功能调节;二)感官监督;三)规范。语音特征是使用经过监督的事先培训的自动语音识别模式中间层,或者使用经过预先培训的自闭式学习模式。我们进一步观察了选择不同功能嵌入层的效果,同时考虑手动和非因果两种模式和非因果模式。我们评估了注射语音信息的最佳配置。我们用SSSL来评估最佳配置。