Supervised deep learning approaches to underdetermined audio source separation achieve state-of-the-art performance but require a dataset of mixtures along with their corresponding isolated source signals. Such datasets can be extremely costly to obtain for musical mixtures. This raises a need for unsupervised methods. We propose a novel unsupervised model-based deep learning approach to musical source separation. Each source is modelled with a differentiable parametric source-filter model. A neural network is trained to reconstruct the observed mixture as a sum of the sources by estimating the source models' parameters given their fundamental frequencies. At test time, soft masks are obtained from the synthesized source signals. The experimental evaluation on a vocal ensemble separation task shows that the proposed method outperforms learning-free methods based on nonnegative matrix factorization and a supervised deep learning baseline. Integrating domain knowledge in the form of source models into a data-driven method leads to high data efficiency: the proposed approach achieves good separation quality even when trained on less than three minutes of audio. This work makes powerful deep learning based separation usable in scenarios where training data with ground truth is expensive or nonexistent.
翻译:由监督下的深层次学习方法,以确定出的声音源分离方法达到最先进的性能,但需要一套混合物及其相应的孤立源信号的数据集。这些数据集对于音乐混合物来说成本极高。这就需要一种不受监督的方法。我们提议了一种新型的、不受监督的、基于模型的深层次学习方法,以音乐源分离。每种来源都以不同的参数源过滤模型为模范。一个神经网络经过培训,通过估计源模型的基本频率参数,将观测到的混合物作为源数的一个总和加以重建。在测试时,从合成源信号中获取软面罩。关于音元词分离的实验性评价显示,拟议的方法在非负面矩阵因子化和受监督的深层次学习基线的基础上,优于无学习的方法。将源模型形式的域知识纳入数据驱动方法,可以提高数据效率:拟议的方法即使经过不到3分钟的音频培训,也能达到良好的分离质量。这项工作使得基于深层次的分离,可以在有地面真相的培训数据昂贵或不存在的情况下进行。