This work presents self-supervised learning methods for developing monaural speaker-specific (i.e., personalized) speech enhancement models. While generalist models must broadly address many speakers, specialist models can adapt their enhancement function towards a particular speaker's voice, expecting to solve a narrower problem. Hence, specialists are capable of achieving more optimal performance in addition to reducing computational complexity. However, naive personalization methods can require clean speech from the target user, which is inconvenient to acquire, e.g., due to subpar recording conditions. To this end, we pose personalization as either a zero-shot task, in which no additional clean speech of the target speaker is used for training, or a few-shot learning task, in which the goal is to minimize the duration of the clean speech used for transfer learning. With this paper, we propose self-supervised learning methods as a solution to both zero- and few-shot personalization tasks. The proposed methods are designed to learn the personalized speech features from unlabeled data (i.e., in-the-wild noisy recordings from the target user) without knowing the corresponding clean sources. Our experiments investigate three different self-supervised learning mechanisms. The results show that self-supervised models achieve zero-shot and few-shot personalization using fewer model parameters and less clean data from the target user, achieving the data efficiency and model compression goals.
翻译:这项工作为开发针对具体语言的(即个性化)语言强化模式提供了自我监督的学习方法。虽然一般主义模式必须广泛针对许多发言者,但专家模式可以调整其增强功能,使其适应特定发言者的声音,期望解决一个较窄的问题。因此,专家除了能够降低计算的复杂性之外,还能取得更优的性能。然而,天真的个性化方法可能需要目标用户用干净的言辞,例如由于低级记录条件而难以获得。为此,我们将个性化作为零光化的任务,即目标发言者不再用干净的言辞进行培训,或执行一些短片的学习任务,目标是尽量减少用于转移学习的清洁言语的持续时间。我们用本文提出自我监督的学习方法,作为零光和少镜头的个性化任务的解决办法。拟议方法旨在从未标出的数据中学习个人化的言词特征(即来自目标用户的曲响音录音),而没有了解相应的干净的言词源。我们实验用三个不同的自我监督式的学习方法,用更低的自标式,用更低的自标的自标式,用更低的自标的自标的自标的自标。