With the recent advancements of data driven approaches using deep neural networks, music source separation has been formulated as an instrument-specific supervised problem. While existing deep learning models implicitly absorb the spatial information conveyed by the multi-channel input signals, we argue that a more explicit and active use of spatial information could not only improve the separation process but also provide an entry-point for many user-interaction based tools. To this end, we introduce a control method based on the stereophonic location of the sources of interest, expressed as the panning angle. We present various conditioning mechanisms, including the use of raw angle and its derived feature representations, and show that spatial information helps. Our proposed approaches improve the separation performance compared to location agnostic architectures by 1.8 dB SI-SDR in our Slakh-based simulated experiments. Furthermore, the proposed methods allow for the disentanglement of same-class instruments, for example, in mixtures containing two guitar tracks. Finally, we also demonstrate that our approach is robust to incorrect source panning information, which can be incurred by our proposed user interaction.
翻译:由于最近利用深神经网络推进了数据驱动方法,音乐源分离已发展成为一个特定仪器的监管问题。虽然现有的深层次学习模式隐含地吸收了多通道输入信号所传送的空间信息,但我们认为,更加明确和积极地使用空间信息不仅可以改进分离过程,而且可以为许多基于用户互动的工具提供一个切入点。为此,我们采用了一种基于利益源的立体声定位的控制方法,以横角为表达面。我们提出了各种调节机制,包括原始角度的使用及其衍生特征显示,并表明空间信息有帮助。我们提出的方法改进了与基于Slakh的模拟实验中1.8 dB SI-SDR所显示的位置随机结构相比的分离性性能。此外,拟议方法还允许同级仪器的分解,例如含有两个吉他音轨的混合物。最后,我们还表明,我们的方法是稳健地利用不正确的来源传递信息,这可以通过我们提议的用户互动产生。