In realistic speech enhancement settings for end-user devices, we often encounter only a few speakers and noise types that tend to reoccur in the specific acoustic environment. We propose a novel personalized speech enhancement method to adapt a compact denoising model to the test-time specificity. Our goal in this test-time adaptation is to utilize no clean speech target of the test speaker, thus fulfilling the requirement for zero-shot learning. To complement the lack of clean utterance, we employ the knowledge distillation framework. Instead of the missing clean utterance target, we distill the more advanced denoising results from an overly large teacher model, and use it as the pseudo target to train the small student model. This zero-shot learning procedure circumvents the process of collecting users' clean speech, a process that users are reluctant to comply due to privacy concerns and technical difficulty of recording clean voice. Experiments on various test-time conditions show that the proposed personalization method achieves significant performance gains compared to larger baseline networks trained from a large speaker- and noise-agnostic datasets. In addition, since the compact personalized models can outperform larger general-purpose models, we claim that the proposed method performs model compression with no loss of denoising performance.
翻译:在现实的终端用户装置语音增强设置中,我们往往只遇到少数在特定音响环境中往往会再次出现的发言者和噪音类型。我们提出一种新的个性化语音增强方法,以适应测试时间的特殊性。我们的测试时间适应目标是不使用测试演讲者清洁的语音目标,从而满足零发学的要求。为了补充缺乏干净的发音,我们使用了知识蒸馏框架。我们没有缺少干净的发音目标,而是从一个过于庞大的教师模型中提取了更先进的拆除结果,并把它用作培训小型学生模型的假目标。这种零光学习程序绕过了收集用户清洁演讲的过程,而用户由于隐私问题和记录清洁声音的技术困难而不愿遵守这一过程。对各种测试时间条件的实验表明,与从大型语音和噪声分析数据集中培训的大型基线网络相比,拟议的个性化方法取得了显著的绩效收益。此外,由于压缩个人化的缩写模型不能超越大型通用模型,我们声称,采用拟议的方法将不进行压缩。