We propose a unified model for three inter-related tasks: 1) to \textit{separate} individual sound sources from a mixed music audio, 2) to \textit{transcribe} each sound source to MIDI notes, and 3) to\textit{ synthesize} new pieces based on the timbre of separated sources. The model is inspired by the fact that when humans listen to music, our minds can not only separate the sounds of different instruments, but also at the same time perceive high-level representations such as score and timbre. To mirror such capability computationally, we designed a pitch-timbre disentanglement module based on a popular encoder-decoder neural architecture for source separation. The key inductive biases are vector-quantization for pitch representation and pitch-transformation invariant for timbre representation. In addition, we adopted a query-by-example method to achieve \textit{zero-shot} learning, i.e., the model is capable of doing source separation, transcription, and synthesis for \textit{unseen} instruments. The current design focuses on audio mixtures of two monophonic instruments. Experimental results show that our model outperforms existing multi-task baselines, and the transcribed score serves as a powerful auxiliary for separation tasks.
翻译:我们为三种相互关联的任务提出了一个统一模式:(1)至\textit{sparterate}个人声音来源,来自混合音乐音频,(2)至\textit{trax}}每个声音源,以MIDI注注,和(3)至\textit{合成}新片,基于分离源的边宽度。这个模式的灵感来自以下事实:当人类听音乐时,我们的思想不仅可以分离不同乐器的音效,而且可以同时感知得分和音调等高级别表现。为了反映这种能力计算,我们设计了一个基于流行的编码-脱色器神经结构的倾角分解模块,用于源分离。关键感应偏向性偏向性是投放代表方的矢量分化和调异性表示。此外,我们采用了逐倍解的方法来实现\textit{ze-shot}学习,即模型能够进行源分解、抄录和合成\text{unseen}仪器。当前设计矩阵的矢量分数分数- 显示我们现有双级的磁级分制的分制工具。