Deep neural network based methods have been successfully applied to music source separation. They typically learn a mapping from a mixture spectrogram to a set of source spectrograms, all with magnitudes only. This approach has several limitations: 1) its incorrect phase reconstruction degrades the performance, 2) it limits the magnitude of masks between 0 and 1 while we observe that 22% of time-frequency bins have ideal ratio mask values of over~1 in a popular dataset, MUSDB18, 3) its potential on very deep architectures is under-explored. Our proposed system is designed to overcome these. First, we propose to estimate phases by estimating complex ideal ratio masks (cIRMs) where we decouple the estimation of cIRMs into magnitude and phase estimations. Second, we extend the separation method to effectively allow the magnitude of the mask to be larger than 1. Finally, we propose a residual UNet architecture with up to 143 layers. Our proposed system achieves a state-of-the-art MSS result on the MUSDB18 dataset, especially, a SDR of 8.98~dB on vocals, outperforming the previous best performance of 7.24~dB. The source code is available at: https://github.com/bytedance/music_source_separation
翻译:以深神经网络为基础的方法已经成功地应用于音乐源的分离。 它们通常会从混合光谱图到一组源光谱图进行绘图, 并且只有数量级。 这种方法有几个限制:(1) 不正确的阶段重建会降低性能, (2) 将遮罩的尺寸限制在0到1之间, 而我们观察到22%的时间频箱在流行数据集中的理想比例掩码值超过~1, MUSDB18, 3) 它在非常深层的建筑中的潜力没有得到充分的探索。 我们提议的系统旨在克服这些。 首先, 我们提议通过估计复杂的理想比例掩码(cIRMs)来估计各个阶段, 我们把对CIRMs的估计分解为规模和阶段估计。 其次, 我们扩大分离方法,以便有效地使遮罩的尺寸大于1. 最后,我们建议一个高达143层的剩余UNet结构。 我们提议的系统在MUSDB18数据集上取得了一个最先进的MSSASS结果, 特别是8. 98- dB 的SIDR, 以声波为声调, 将过去的最佳表现为7.24_ debismab/ abisal.