Source separation is a fundamental task in speech, music, and audio processing, and it also provides cleaner and larger data for training generative models. However, improving separation performance in practice often depends on increasingly large networks, inflating training and deployment costs. Motivated by recent advances in inference-time scaling for generative modeling, we propose Training-Time and Inference-Time Scalable Discriminative Source Separation (TISDiSS), a unified framework that integrates early-split multi-loss supervision, shared-parameter design, and dynamic inference repetitions. TISDiSS enables flexible speed-performance trade-offs by adjusting inference depth without retraining additional models. We further provide systematic analyses of architectural and training choices and show that training with more inference repetitions improves shallow-inference performance, benefiting low-latency applications. Experiments on standard speech separation benchmarks demonstrate state-of-the-art performance with a reduced parameter count, establishing TISDiSS as a scalable and practical framework for adaptive source separation. Code is available at https://github.com/WingSingFung/TISDiSS.
翻译:源分离是语音、音乐和音频处理中的一项基本任务,同时也为训练生成模型提供了更干净、更大量的数据。然而,在实践中提升分离性能通常依赖于日益庞大的网络,这增加了训练和部署成本。受生成模型中推理时缩放技术最新进展的启发,我们提出了一种训练时与推理时可扩展的判别式源分离(TISDiSS)统一框架,该框架集成了早期分割多损失监督、共享参数设计和动态推理重复机制。TISDiSS 能够通过调整推理深度来实现灵活的速度-性能权衡,而无需重新训练额外模型。我们进一步对架构和训练选择进行了系统分析,结果表明,使用更多推理重复次数进行训练可以提升浅层推理的性能,这对低延迟应用有益。在标准语音分离基准测试上的实验表明,该方法以更少的参数量实现了最先进的性能,确立了 TISDiSS 作为一个可扩展且实用的自适应源分离框架。代码可在 https://github.com/WingSingFung/TISDiSS 获取。