We introduce AnyEnhance, a unified generative model for voice enhancement that processes both speech and singing voices. Based on a masked generative model, AnyEnhance is capable of handling both speech and singing voices, supporting a wide range of enhancement tasks including denoising, dereverberation, declipping, super-resolution, and target speaker extraction, all simultaneously and without fine-tuning. AnyEnhance introduces a prompt-guidance mechanism for in-context learning, which allows the model to natively accept a reference speaker's timbre. In this way, it could boost enhancement performance when a reference audio is available and enable the target speaker extraction task without altering the underlying architecture. Moreover, we also introduce a self-critic mechanism into the generative process for masked generative models, yielding higher-quality outputs through iterative self-assessment and refinement. Extensive experiments on various enhancement tasks demonstrate AnyEnhance outperforms existing methods in terms of both objective metrics and subjective listening tests. Demo audios are publicly available at https://amphionspace.github.io/anyenhance. An open-source implementation is provided at https://github.com/viewfinder-annn/anyenhance-v1-ccf-aatc.
翻译:本文提出AnyEnhance,一种适用于语音与歌声的统一生成式语音增强模型。该模型基于掩码生成架构,能够同时处理语音与歌声信号,支持去噪、去混响、削峰修复、超分辨率及目标说话人提取等多种增强任务,且无需针对特定任务进行微调。AnyEnhance创新性地引入提示引导机制实现上下文学习,使模型能够原生接受参考说话人的音色特征。该机制在存在参考音频时可提升增强性能,并能在不改变基础架构的前提下实现目标说话人提取任务。此外,我们还在掩码生成过程中引入自我批判机制,通过迭代式自我评估与优化生成更高质量的音频。在多类增强任务上的大量实验表明,AnyEnhance在客观指标与主观听感测试中均优于现有方法。演示音频已公开于https://amphionspace.github.io/anyenhance,开源实现发布于https://github.com/viewfinder-annn/anyenhance-v1-ccf-aatc。