Supervised neural network training has led to significant progress on single-channel sound separation. This approach relies on ground truth isolated sources, which precludes scaling to widely available mixture data and limits progress on open-domain tasks. The recent mixture invariant training (MixIT) method enables training on in-the wild data; however, it suffers from two outstanding problems. First, it produces models which tend to over-separate, producing more output sources than are present in the input. Second, the exponential computational complexity of the MixIT loss limits the number of feasible output sources. These problems interact: increasing the number of output sources exacerbates over-separation. In this paper we address both issues. To combat over-separation we introduce new losses: sparsity losses that favor fewer output sources and a covariance loss that discourages correlated outputs. We also experiment with a semantic classification loss by predicting weak class labels for each mixture. To extend MixIT to larger numbers of sources, we introduce an efficient approximation using a fast least-squares solution, projected onto the MixIT constraint set. Our experiments show that the proposed losses curtail over-separation and improve overall performance. The best performance is achieved using larger numbers of output sources, enabled by our efficient MixIT loss, combined with sparsity losses to prevent over-separation. On the FUSS test set, we achieve over 13 dB in multi-source SI-SNR improvement, while boosting single-source reconstruction SI-SNR by over 17 dB.
翻译:监督神经网络培训导致单通道声音分离方面的显著进展。 这种方法依赖于地面的无事实来源, 从而无法扩大可广泛获得的混合数据, 并限制开放域任务的进展。 最近混合的混合培训方法使野生数据培训成为了; 但是,它有两个未决问题。 首先, 它产生的模型倾向于过度分离, 产生比输入中存在的更多的产出源。 第二, 混合流离损失的指数计算复杂性限制了可行的产出源的数量。 这些问题相互作用: 产出源数量的增加加剧了过度分离。 在本文中,我们讨论这两个问题。 为了消除过度分离,我们引入了新的损失: 偏重产出源的波动性损失, 以及抑制相关产出的共变异性损失。 我们还通过预测每种混合物的低级等级标签, 将MixIT扩大到更多的来源, 我们采用快速最小的解决方案引入高效近近似近似近似。 我们的IMI- S 测试显示, 以最大的业绩源值衡量了我们的业绩损失。