Deep learning-based approaches to musical source separation are often limited to the instrument classes that the models are trained on and do not generalize to separate unseen instruments. To address this, we propose a few-shot musical source separation paradigm. We condition a generic U-Net source separation model using few audio examples of the target instrument. We train a few-shot conditioning encoder jointly with the U-Net to encode the audio examples into a conditioning vector to configure the U-Net via feature-wise linear modulation (FiLM). We evaluate the trained models on real musical recordings in the MUSDB18 and MedleyDB datasets. We show that our proposed few-shot conditioning paradigm outperforms the baseline one-hot instrument-class conditioned model for both seen and unseen instruments. To extend the scope of our approach to a wider variety of real-world scenarios, we also experiment with different conditioning example characteristics, including examples from different recordings, with multiple sources, or negative conditioning examples.
翻译:音乐源分离的深层次学习方法往往局限于模型所训练的仪器类,而不是一般地区分看不见的仪器。为了解决这个问题,我们建议了几发音乐源分离范式。我们用目标仪器的几个音频实例设计了通用 U-Net 源分离模型。我们与U-Net 联合培训了几发调制解调器编码器,将音频实例编码成一个调节矢量,通过地貌线调调(FILM)来配置U-Net。我们评估了MUSB18 和MedleyDB 数据集中经过培训的真实音乐录音模型。我们表明,我们提议的几发调制式模型超越了所见和不可见仪器的基线一热仪器级条件模型。为了将我们的方法范围扩大到更广泛的现实世界情景,我们还实验了不同的调控样特征,包括来自不同录音、有多种来源或负调试样。