This work explores how self-supervised learning can be universally used to discover speaker-specific features towards enabling personalized speech enhancement models. We specifically address the few-shot learning scenario where access to cleaning recordings of a test-time speaker is limited to a few seconds, but noisy recordings of the speaker are abundant. We develop a simple contrastive learning procedure which treats the abundant noisy data as makeshift training targets through pairwise noise injection: the model is pretrained to maximize agreement between pairs of differently deformed identical utterances and to minimize agreement between pairs of similarly deformed nonidentical utterances. Our experiments compare the proposed pretraining approach with two baseline alternatives: speaker-agnostic fully-supervised pretraining, and speaker-specific self-supervised pretraining without contrastive loss terms. Of all three approaches, the proposed method using contrastive mixtures is found to be most robust to model compression (using 85% fewer parameters) and reduced clean speech (requiring only 3 seconds).
翻译:这项工作探索如何普遍使用自我监督的学习来发现使个人化的语音增强模式成为个人化的演讲者特有的特征。 我们具体处理了少数的学习情景,即测试时演讲者清洁记录的机会限于几秒钟,但发言者的录音却十分吵闹。 我们开发了一个简单的对比学习程序,通过双向噪音注射,将大量吵杂数据作为临时培训目标:该模式预先做好准备,以最大限度地实现不同畸形相同词组之间的协议,并尽量减少相似的畸形非同义词组之间的协议。 我们的实验将拟议的培训前方法与两种基线备选方案进行了比较:完全由监督的演讲者参加培训前,以及不受截然损失的针对特定发言者的自我监督前培训。 在全部三种办法中,使用对比混合物的拟议方法被认为最可靠,可以模拟压缩(使用85%的参数)和减少清洁话组(只需要3秒钟)。