Training personalized speech enhancement models is innately a no-shot learning problem due to privacy constraints and limited access to noise-free speech from the target user. If there is an abundance of unlabeled noisy speech from the test-time user, a personalized speech enhancement model can be trained using self-supervised learning. One straightforward approach to model personalization is to use the target speaker's noisy recordings as pseudo-sources. Then, a pseudo denoising model learns to remove injected training noises and recover the pseudo-sources. However, this approach is volatile as it depends on the quality of the pseudo-sources, which may be too noisy. As a remedy, we propose an improvement to the self-supervised approach through data purification. We first train an SNR predictor model to estimate the frame-by-frame SNR of the pseudo-sources. Then, the predictor's estimates are converted into weights which adjust the frame-by-frame contribution of the pseudo-sources towards training the personalized model. We empirically show that the proposed data purification step improves the usability of the speaker-specific noisy data in the context of personalized speech enhancement. Without relying on any clean speech recordings or speaker embeddings, our approach may be seen as privacy-preserving.
翻译:个人化的语音增强模式是个人化的模型,因为隐私受限,而且目标用户获得无噪音演讲的机会有限,因此,个人化的语音增强模式本质上是一个无光的学习问题。如果测试时间用户有大量未贴标签的噪音演讲,那么个人化的语音增强模式可以使用自我监督的学习来培训。个人化模式的一种直接的方法是使用目标演讲者的噪音录音作为假源。然后,假的去光化模型学会去除注射的培训噪音并回收假源。然而,这种方法是不稳定的,因为它取决于伪源的质量,而伪源可能太吵。作为一种补救措施,我们建议通过数据净化改进自我监督的方法。我们首先培训SNR预测模型,对假源的SNR框架进行估算。然后,将预测器的估计数转换为权重,根据框架调整伪源对培训个人化模式的贡献。我们的经验显示,拟议的数据净化步骤可以改进特定演讲者专用方法的可使用性能通过数据净化方法改进使用性化的自我监督方法。我们首先培训SNRRR预测模型,然后将个人化的隐私数据作为个人化的保密性记录,不依赖个人语音的保密性增强。