Music source separation represents the task of extracting all the instruments from a given song. Recent breakthroughs on this challenge have gravitated around a single dataset, MUSDB, limited to four instrument classes only. Larger datasets and more instruments are costly and time-consuming in collecting data and training deep neural networks (DNNs). In this work, we propose a fast method to evaluate the separability of instruments in any dataset without training and tuning a DNN. This separability measure helps to select appropriate samples for the efficient training of neural networks. Based on the oracle principle with an ideal ratio mask, our approach is an excellent proxy to estimate the separation performances of state-of-the-art deep learning approaches based on Time-Frequency (TF) masking such as TasNet or Open-Unmix. Our results indicate that even if TasNet has the freedom to learn a latent space where instruments would be separated efficiently by a masking process, this space is no better than the TF representation.
翻译:音乐源分离代表着从某首歌中提取所有乐器的任务。 最近关于这一挑战的突破围绕一个数据集( MUSDB, 仅限于4个仪器类)引来。 更大的数据集和更多的仪器在收集数据和培训深神经网络( DNNs)方面成本高且耗时。 在这项工作中,我们提出了一个快速的方法来评价任何数据集中仪器的分离性,而不培训和调制 DNN。 这一分离性措施有助于为神经网络的有效培训选择适当的样本。 基于一个带有理想比率掩码的奥克莱原则, 我们的方法极好地替代了基于时间- 宽度掩码( TasNet ) 或 Open- Unmix 等最新深度学习方法的分离性能。 我们的结果表明,即使TasNet 能够自由学习通过掩码过程有效分离仪器的潜在空间,但这一空间并不比TF 代表系统更好。