In real-world scenarios, speech signals are inevitably corrupted by various types of interference, making speech enhancement (SE) a critical task for robust speech processing. However, most existing SE methods only handle a limited range of distortions, such as additive noise, reverberation, or band limitation, while the study of SE under multiple simultaneous distortions remains limited. This gap affects the generalization and practical usability of SE methods in real-world environments.To address this gap, this paper proposes a novel Universal Discrete-domain SE model called UDSE.Unlike regression-based SE models that directly predict clean speech waveform or continuous features, UDSE redefines SE as a discrete-domain classification task, instead predicting the clean discrete tokens quantized by the residual vector quantizer (RVQ) of a pre-trained neural speech codec.Specifically, UDSE first extracts global features from the degraded speech. Guided by these global features, the clean token prediction for each VQ follows the rules of RVQ, where the prediction of each VQ relies on the results of the preceding ones. Finally, the predicted clean tokens from all VQs are decoded to reconstruct the clean speech waveform. During training, the UDSE model employs a teacher-forcing strategy, and is optimized with cross-entropy loss. Experimental results confirm that the proposed UDSE model can effectively enhance speech degraded by various conventional and unconventional distortions, e.g., additive noise, reverberation, band limitation, clipping, phase distortion, and compression distortion, as well as their combinations. These results demonstrate the superior universality and practicality of UDSE compared to advanced regression-based SE methods.
翻译:在现实场景中,语音信号不可避免地受到多种干扰的污染,使得语音增强成为鲁棒语音处理的关键任务。然而,现有的大多数语音增强方法仅能处理有限范围的失真类型,如加性噪声、混响或带宽限制,而针对多种失真同时存在条件下的语音增强研究仍显不足。这一缺陷影响了语音增强方法在真实环境中的泛化能力和实际可用性。为填补这一空白,本文提出了一种新颖的通用离散域语音增强模型,称为UDSE。与基于回归的语音增强模型直接预测纯净语音波形或连续特征不同,UDSE将语音增强重新定义为离散域分类任务,转而预测由预训练神经语音编解码器的残差向量量化器量化的纯净离散令牌。具体而言,UDSE首先从退化语音中提取全局特征。在这些全局特征的引导下,每个VQ的纯净令牌预测遵循RVQ的规则,即每个VQ的预测依赖于前一个VQ的结果。最后,所有VQ预测出的纯净令牌被解码以重建纯净语音波形。在训练过程中,UDSE模型采用教师强制策略,并通过交叉熵损失进行优化。实验结果证实,所提出的UDSE模型能够有效增强受各种常规和非常规失真退化的语音,例如加性噪声、混响、带宽限制、削波、相位失真和压缩失真,以及它们的组合。这些结果表明,与先进的基于回归的语音增强方法相比,UDSE具有更优越的通用性和实用性。