基于神经音频编解码器的提示驱动通用声音分离 (Neural Audio Codecs for Prompt-Driven Universal Sound Separation)

Text-guided sound separation supports flexible audio editing across media and assistive applications, but existing models like AudioSep are too compute-heavy for edge deployment. Neural audio codec (NAC) models such as CodecFormer and SDCodec are compute-efficient but limited to fixed-class separation. We introduce CodecSep, the first NAC-based model for on-device universal, text-driven separation. CodecSep combines DAC compression with a Transformer masker modulated by CLAP-derived FiLM parameters. Across six open-domain benchmarks under matched training/prompt protocols, \textbf{CodecSep} surpasses \textbf{AudioSep} in separation fidelity (SI-SDR) while remaining competitive in perceptual quality (ViSQOL) and matching or exceeding fixed-stem baselines (TDANet, CodecFormer, SDCodec). In code-stream deployments, it needs just 1.35~GMACs end-to-end -- approximately $54\times$ less compute ($25\times$ architecture-only) than spectrogram-domain separators like AudioSep -- while remaining fully bitstream-compatible.

翻译：文本引导的声音分离支持跨媒体和辅助应用中的灵活音频编辑，但现有模型（如AudioSep）计算量过大，难以在边缘设备上部署。神经音频编解码器（NAC）模型（如CodecFormer和SDCodec）计算效率高，但仅限于固定类别的分离。我们提出了CodecSep，这是首个基于NAC的、适用于设备端通用文本驱动分离的模型。CodecSep结合了DAC压缩与由CLAP衍生的FiLM参数调制的Transformer掩码器。在匹配的训练/提示协议下，跨越六个开放领域基准测试，\\textbf{CodecSep}在分离保真度（SI-SDR）上超越了\\textbf{AudioSep}，同时在感知质量（ViSQOL）上保持竞争力，并匹配或超越了固定声源基线（TDANet、CodecFormer、SDCodec）。在码流部署中，其端到端仅需1.35~GMACs计算量——相比频谱域分离器（如AudioSep）减少约$54\\times$（仅架构部分为$25\\times$）——同时保持完全的比特流兼容性。