With the development of deep learning, neural network-based speech enhancement (SE) models have shown excellent performance. Meanwhile, it was shown that the development of self-supervised pre-trained models can be applied to various downstream tasks. In this paper, we will consider the application of the pre-trained model to the real-time SE problem. Specifically, the encoder and bottleneck layer of the DEMUCS model are initialized using the self-supervised pretrained WavLM model, the convolution in the encoder is replaced by causal convolution, and the transformer encoder in the bottleneck layer is based on causal attention mask. In addition, as discretizing the noisy speech representations is more beneficial for denoising, we utilize a quantization module to discretize the representation output from the bottleneck layer, which is then fed into the decoder to reconstruct the clean speech waveform. Experimental results on the Valentini dataset and an internal dataset show that the pre-trained model based initialization can improve the SE performance and the discretization operation suppresses the noise component in the representations to some extent, which can further improve the performance.
翻译:随着深层学习的开发,神经网络增强语音模型(SE)的开发表现良好,同时,显示开发自我监督的预先培训模型可以适用于各种下游任务。在本文中,我们将考虑将预培训模型应用于实时SE问题。具体地说,DEMUSCS模型的编码器和瓶颈层使用自监督的预先培训WavLM模型进行初始化,编码器中的混凝土被因果堆合所取代,瓶颈层的变压器以因果调控掩罩为基础。此外,由于隔离式语音显示器更有利于分解,我们将使用一个四分化模块将瓶层的表示输出分解,然后输入解码器以重建干净的语音波形。关于Valentini数据集的实验结果和内部数据集显示,预先培训模型的初始化可以改进SE的性能,而离式操作将摄取的噪音部分压缩到一定程度,从而可以进一步改进性能。