Personalized speech enhancement (PSE) models achieve promising results compared with unconditional speech enhancement models due to their ability to remove interfering speech in addition to background noise. Unlike unconditional speech enhancement, causal PSE models may occasionally remove the target speech by mistake. The PSE models also tend to leak interfering speech when the target speaker is silent for an extended period. We show that existing PSE methods suffer from a trade-off between speech over-suppression and interference leakage by addressing one problem at the expense of the other. We propose a new PSE model training framework using cross-task knowledge distillation to mitigate this trade-off. Specifically, we utilize a personalized voice activity detector (pVAD) during training to exclude the non-target speech frames that are wrongly identified as containing the target speaker with hard or soft classification. This prevents the PSE model from being too aggressive while still allowing the model to learn to suppress the input speech when it is likely to be spoken by interfering speakers. Comprehensive evaluation results are presented, covering various PSE usage scenarios.
翻译:个人化的语音增强模式(PSE)与无条件的语音增强模式相比,取得了大有希望的成果,因为它们能够除背景噪音之外消除干扰性言论。与无条件的语音增强不同,因果性PSE模式有时会因错误而删除目标演讲者。PSE模式还倾向于在目标演讲者长时间沉默时泄露干扰性言论。我们表明,现有的PSE方法在言语过度压抑和干扰泄漏之间发生权衡,解决一个问题而牺牲另一个问题。我们提议一个新的PSE模式培训框架,利用跨任务知识蒸馏来减轻这一权衡。具体地说,我们在培训中使用个性化语音活动检测器(PVAD)来排除被错误地确定为包含目标演讲者以硬或软分类的非目标演讲者的非目标演讲框框。这防止了PSE模式过于激进,同时仍然允许模型在可能由干扰演讲者发言时学习抑制输入性言论。我们提出了全面的评价结果,涵盖了各种PSE的使用设想。