Deep learning based speech denoising still suffers from the challenge of improving perceptual quality of enhanced signals. We introduce a generalized framework called Perceptual Ensemble Regularization Loss (PERL) built on the idea of perceptual losses. Perceptual loss discourages distortion to certain speech properties and we analyze it using six large-scale pre-trained models: speaker classification, acoustic model, speaker embedding, emotion classification, and two self-supervised speech encoders (PASE+, wav2vec 2.0). We first build a strong baseline (w/o PERL) using Conformer Transformer Networks on the popular enhancement benchmark called VCTK-DEMAND. Using auxiliary models one at a time, we find acoustic event and self-supervised model PASE+ to be most effective. Our best model (PERL-AE) only uses acoustic event model (utilizing AudioSet) to outperform state-of-the-art methods on major perceptual metrics. To explore if denoising can leverage full framework, we use all networks but find that our seven-loss formulation suffers from the challenges of Multi-Task Learning. Finally, we report a critical observation that state-of-the-art Multi-Task weight learning methods cannot outperform hand tuning, perhaps due to challenges of domain mismatch and weak complementarity of losses.
翻译:深层次的基于学习的言语淡化仍然受到提高强化信号感知质量的挑战。我们引入了一个基于感知损失概念的通用框架,称为“感知集合常规损失”(PERL ) 。感知损失抑制了某些言语属性的扭曲,我们用六种大规模预先培训的模式分析它:语音分类、声学模型、扩音器嵌入、情感分类和两个自我监督的语音编码器(PASE+, wav2vec 2.0)。我们首先利用流行增强基准的变换网络(VCTK-DEMAND)建立一个强大的基准(w/o PERL ) 。我们同时使用一个辅助模型,发现声学事件和自我监督的PASE+模型最为有效。我们的最佳模型(PERL-AE)只使用声学事件模型(PARL-AEet) 来超越主要感知度度度度度度度指标的状态设计方法(PASE+, wv2VER 2.0 ) 。我们使用所有网络,但发现我们的七位变换模型的制定方法可能因多重的域互换方法的挑战而无法进行。