Self-supervised speech pre-training enables deep neural network models to capture meaningful and disentangled factors from raw waveform signals. The learned universal speech representations can then be used across numerous downstream tasks. These representations, however, are sensitive to distribution shifts caused by environmental factors, such as noise and/or room reverberation. Their large sizes, in turn, make them unfeasible for edge applications. In this work, we propose a knowledge distillation methodology termed RobustDistiller which compresses universal representations while making them more robust against environmental artifacts via a multi-task learning objective. The proposed layer-wise distillation recipe is evaluated on top of three well-established universal representations, as well as with three downstream tasks. Experimental results show the proposed methodology applied on top of the WavLM Base+ teacher model outperforming all other benchmarks across noise types and levels, as well as reverberation times. Oftentimes, the obtained results with the student model (24M parameters) achieved results inline with those of the teacher model (95M).
翻译:自我监督的语音预演使深神经网络模型能够捕捉来自原始波形信号的有意义和分解因素的深层神经网络模型。学习到的通用语音演示方法可以用于多个下游任务。然而,这些演示方法对于环境因素(如噪音和/或室反响)造成的分布变化十分敏感,而其大小又使其无法用于边缘应用。在这项工作中,我们提出了一个知识蒸馏方法,称为Robust Distler,它压缩了通用的演示方法,同时通过多任务学习目标,使这些演示方法更有力地抵御环境文物。拟议的多层次明智蒸馏法在三个既定通用演示方法以及三个下游任务之上进行评估。实验结果显示在WavLM Base+教师模型顶部应用的拟议方法超越了所有其他关于噪音类型和级别的基准,以及回verberation时间。通常使用学生模型(24M参数)取得的结果与教师模型(95M)的结果相匹配。