With advances in deep learning, neural network based speech enhancement (SE) has developed rapidly in the last decade. Meanwhile, the self-supervised pre-trained model and vector quantization (VQ) have achieved excellent performance on many speech-related tasks, while they are less explored on SE. As it was shown in our previous work that utilizing a VQ module to discretize noisy speech representations is beneficial for speech denoising, in this work we therefore study the impact of using VQ at different layers with different number of codebooks. Different VQ modules indeed enable to extract multiple-granularity speech features. Following an attention mechanism, the contextual features extracted by a pre-trained model are fused with the local features extracted by the encoder, such that both global and local information are preserved to reconstruct the enhanced speech. Experimental results on the Valentini dataset show that the proposed model can improve the SE performance, where the impact of choosing pre-trained models is also revealed.
翻译:随着深层学习的进步,神经网络的语音增强(SE)在过去十年中迅速发展。与此同时,自我监督的预先培训模式和矢量量化(VQ)在许多与语音有关的任务上取得了出色的表现,而在SE上则没有那么深入探讨。正如我们先前的工作所显示的那样,利用VQ模块将吵闹的语音表述分解有利于语言解密,因此,在这项工作中,我们研究了在不同层次使用不同数量代码手册使用VQ的影响。不同的VQ模块确实能够提取多种语调特征。按照关注机制,通过预先培训模式提取的背景特征与编码器提取的本地特征相结合,从而保留全球和地方信息以重建强化的语音。关于Valentini数据集的实验结果表明,拟议的模型可以改善SE的性能,在选择预先培训模式时,也揭示了选择培训模式的影响。