The development of neural audio codecs (NACs) has largely promoted applications of language models (LMs) to speech processing and understanding. However, there lacks the verification on the effectiveness of autoregressive (AR) LMbased models in unifying different sub-tasks of speech enhancement (SE). In this work, we propose UniSE, a unified decoder-only LM-based framework to handle different SE tasks including speech restoration, target speaker extraction and speech separation. It takes input speech features as conditions and generates discrete tokens of the target speech using AR modeling, which facilitates a compatibility between distinct learning patterns of multiple tasks. Experiments on several benchmarks indicate the proposed UniSE can achieve competitive performance compared to discriminative and generative baselines, showing the capacity of LMs in unifying SE tasks. The demo page is available here: https://github.com/hyyan2k/UniSE.
翻译:神经音频编解码器(NACs)的发展极大地促进了语言模型(LMs)在语音处理与理解领域的应用。然而,目前尚缺乏对基于自回归(AR)语言模型的框架在统一语音增强(SE)不同子任务方面有效性的验证。本研究提出UniSE,一个基于仅解码器语言模型的统一框架,用于处理包括语音修复、目标说话人提取和语音分离在内的不同SE任务。该框架以输入语音特征为条件,利用自回归建模生成目标语音的离散标记,从而促进了多种任务不同学习模式之间的兼容性。在多个基准测试上的实验表明,所提出的UniSE相比判别式和生成式基线模型均能取得具有竞争力的性能,展现了语言模型在统一语音增强任务方面的能力。演示页面可在此处访问:https://github.com/hyyan2k/UniSE。