Speech enhancement systems can show improved performance by adapting the model towards a single test-time speaker. In this personalization context, the test-time user might only provide a small amount of noise-free speech data, likely insufficient for traditional fully-supervised learning. One way to overcome the lack of personal data is to transfer the model parameters from a speaker-agnostic model to initialize the personalized model, and then to finetune the model using the small amount of personal speech data. This baseline marginally adapts over the scarce clean speech data. Alternatively, we propose self-supervised methods that are designed specifically to learn personalized and discriminative features from abundant in-the-wild noisy, but still personal speech recordings. Our experiment shows that the proposed self-supervised learning methods initialize personalized speech enhancement models better than the baseline fully-supervised methods, yielding superior speech enhancement performance. The proposed methods also result in a more robust feature set under the real-world conditions: compressed model sizes and fewness of the labeled data.
翻译:增强语音系统可以通过将模型改制成单一的测试时间扬声器来显示改进的性能。 在这种个性化背景下,测试时间使用者可能只提供少量无噪音的语音数据,可能不足以进行传统的完全监督的学习。 克服个人数据缺乏的一个办法是将模型参数从演讲者敏感模型中转移出来,以启动个性化模型,然后使用少量的个人语音数据微调模型。这一基线略微适应了稀缺的清洁语音数据。 或者,我们提出自我监督的方法,专门设计这些方法,以学习来自大量杂音但仍是个人语音记录的个性化和歧视特征。我们的实验表明,拟议的自我监督的学习方法将个人化语音增强模型初始化比基线完全监督的方法更好,从而产生超强的语音增强性能。 拟议的方法还导致在现实世界条件下建立更坚固的特征:压缩模型大小和标签数据很少。