We propose a method for the blind separation of sounds of musical instruments in audio signals. We describe the individual tones via a parametric model, training a dictionary to capture the relative amplitudes of the harmonics. The model parameters are predicted via a U-Net, which is a type of deep neural network. The network is trained without ground truth information, based on the difference between the model prediction and the individual STFT time frames. Since some of the model parameters do not yield a useful backpropagation gradient, we model them stochastically and employ the policy gradient instead. To provide phase information and account for inaccuracies in the dictionary-based representation, we also let the network output a direct prediction, which we then use to resynthesize the audio signals for the individual instruments. Due to the flexibility of the neural network, inharmonicity can be incorporated seamlessly and no preprocessing of the input spectra is required. Our algorithm yields high-quality separation results with particularly low interference on a variety of different audio samples, both acoustic and synthetic, provided that the sample contains enough data for the training and that the spectral characteristics of the musical instruments are sufficiently stable to be approximated by the dictionary.
翻译:我们建议一种方法,在音频信号中盲目分离音乐乐器声音。我们通过参数模型描述单个音调,培训字典以捕捉声调的相对振幅。模型参数通过U-Net预测,这是一种深层神经网络。根据模型预测和单个STFT时间框架的区别,网络在没有地面真相信息的基础上接受培训。由于一些模型参数没有产生有用的反反演梯度,我们对它们进行模拟,并采用政策梯度取而代之。为了提供基于字典的表达方式不准确的阶段信息和账户,我们还让网络输出直接预测,然后我们用这种预测来重新合成单个仪器的音频信号。由于神经网络的灵活性,不协调性可以无缝地结合,不需要预先处理输入光谱。我们的算法产生高质量的分离结果,对各种声音和合成的音频样本的干扰特别低,条件是样品中有足够的数据用于培训,而光谱质仪器的光谱特征也足够稳定。