Single-channel audio separation aims to separate individual sources from a single-channel mixture. Most existing methods rely on supervised learning with synthetically generated paired data. However, obtaining high-quality paired data in real-world scenarios is often difficult. This data scarcity can degrade model performance under unseen conditions and limit generalization ability. To this end, in this work, we approach this problem from an unsupervised perspective, framing it as a probabilistic inverse problem. Our method requires only diffusion priors trained on individual sources. Separation is then achieved by iteratively guiding an initial state toward the solution through reconstruction guidance. Importantly, we introduce an advanced inverse problem solver specifically designed for separation, which mitigates gradient conflicts caused by interference between the diffusion prior and reconstruction guidance during inverse denoising. This design ensures high-quality and balanced separation performance across individual sources. Additionally, we find that initializing the denoising process with an augmented mixture instead of pure Gaussian noise provides an informative starting point that significantly improves the final performance. To further enhance audio prior modeling, we design a novel time-frequency attention-based network architecture that demonstrates strong audio modeling capability. Collectively, these improvements lead to significant performance gains, as validated across speech-sound event, sound event, and speech separation tasks.
翻译:单通道音频分离旨在从单通道混合音频中分离出各个独立声源。现有方法大多依赖于使用合成生成的配对数据进行监督学习。然而,在实际场景中获取高质量的配对数据通常较为困难。这种数据稀缺性可能导致模型在未见条件下的性能下降,并限制其泛化能力。为此,本研究从无监督角度出发,将该问题构建为概率逆问题。我们的方法仅需在独立声源上训练的扩散先验模型,随后通过重构引导将初始状态迭代地导向解空间。值得注意的是,我们引入了一种专为分离任务设计的先进逆问题求解器,该设计缓解了在逆去噪过程中扩散先验与重构引导之间因干扰而产生的梯度冲突问题,从而确保各声源间实现高质量且均衡的分离性能。此外,我们发现使用增强混合信号而非纯高斯噪声初始化去噪过程,能够提供信息更丰富的起始点,显著提升最终性能。为进一步增强音频先验建模能力,我们设计了一种新颖的基于时频注意力机制的神经网络架构,该架构展现出强大的音频建模能力。综合这些改进措施,在语音-声音事件、声音事件及语音分离任务上的验证结果表明,本方法取得了显著的性能提升。